pyspark.pandas.DataFrame.assign#

DataFrame.assign(**kwargs)[source]#

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters

**kwargsdict of {str: callable, Series or Index}: The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas-on-Spark doesn’t check it). If the values are not callable, (e.g. a Series or a literal), they are simply assigned.

Returns

DataFrame: A new DataFrame with the new columns in addition to all the existing columns.

Notes

Assigning multiple columns within the same assign is possible but you cannot refer to newly created or modified columns. This feature is supported in pandas for Python 3.6 and later but not in pandas-on-Spark. In pandas-on-Spark, all items are computed first, and then assigned.

Examples

>>> df = ps.DataFrame({'temp_c': [17.0, 25.0]},
...                   index=['Portland', 'Berkeley'])
>>> df
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence and you can also create multiple columns within the same assign.

>>> assigned = df.assign(temp_f=df['temp_c'] * 9 / 5 + 32,
...                      temp_k=df['temp_c'] + 273.15,
...                      temp_idx=df.index)
>>> assigned[['temp_c', 'temp_f', 'temp_k', 'temp_idx']]
          temp_c  temp_f  temp_k  temp_idx
Portland    17.0    62.6  290.15  Portland
Berkeley    25.0    77.0  298.15  Berkeley