pyspark.pandas.DataFrame.groupby¶
-
DataFrame.
groupby
(by: Union[Any, Tuple[Any, …], Series, List[Union[Any, Tuple[Any, …], Series]]], axis: Union[int, str] = 0, as_index: bool = True, dropna: bool = True) → DataFrameGroupBy[source]¶ Group DataFrame or Series using a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
- Parameters
- bySeries, label, or list of labels
Used to determine the groups for the groupby. If Series is passed, the Series or dict VALUES will be used to determine the groups. A label or list of labels may be passed to group by the columns in
self
.- axisint, default 0 or ‘index’
Can only be set to 0 at the moment.
- as_indexbool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
- dropnabool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.
- Returns
- DataFrameGroupBy or SeriesGroupBy
Depends on the calling object and returns groupby object that contains information about the groups.
See also
pyspark.pandas.groupby.GroupBy
Examples
>>> df = ps.DataFrame({'Animal': ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed': [380., 370., 24., 26.]}, ... columns=['Animal', 'Max Speed']) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0
>>> df.groupby(['Animal']).mean().sort_index() Max Speed Animal Falcon 375.0 Parrot 25.0
>>> df.groupby(['Animal'], as_index=False).mean().sort_values('Animal') ... Animal Max Speed ...Falcon 375.0 ...Parrot 25.0
We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True:
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] >>> df = ps.DataFrame(l, columns=["a", "b", "c"]) >>> df.groupby(by=["b"]).sum().sort_index() a c b 1.0 2 3 2.0 2 5
>>> df.groupby(by=["b"], dropna=False).sum().sort_index() a c b 1.0 2 3 2.0 2 5 NaN 1 4