pyspark.pandas.DataFrame.spark.apply

spark.apply(func: Callable[[pyspark.sql.dataframe.DataFrame], pyspark.sql.dataframe.DataFrame], index_col: Union[str, List[str], None] = None) → ps.DataFrame

Applies a function that takes and returns a Spark DataFrame. It allows natively apply a Spark function and column APIs with the Spark column internally used in Series or Index.

Note

set index_col and keep the column named as so in the output Spark DataFrame to avoid using the default index to prevent performance penalty. If you omit index_col, it will use default index which is potentially expensive in general.

Note

it will lose column labels. This is a synonym of func(psdf.to_spark(index_col)).to_pandas_on_spark(index_col).

Parameters
funcfunction

Function to apply the function against the data by using Spark DataFrame.

Returns
DataFrame
Raises
ValueErrorIf the output from the function is not a Spark DataFrame.

Examples

>>> psdf = ps.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, columns=["a", "b"])
>>> psdf
   a  b
0  1  4
1  2  5
2  3  6
>>> psdf.spark.apply(
...     lambda sdf: sdf.selectExpr("a + b as c", "index"), index_col="index")
... 
       c
index
0      5
1      7
2      9

The case below ends up with using the default index, which should be avoided if possible.

>>> psdf.spark.apply(lambda sdf: sdf.groupby("a").count().sort("a"))
   a  count
0  1      1
1  2      1
2  3      1