pyspark.sql.functions.histogram_numeric#

pyspark.sql.functions.histogram_numeric(col, nBins)[source]#

Computes a histogram on numeric ‘col’ using nb bins. The return value is an array of (x,y) pairs representing the centers of the histogram’s bins. As the value of ‘nb’ is increased, the histogram approximation gets finer-grained, but may yield artifacts around outliers. In practice, 20-40 histogram bins appear to work well, with more bins being required for skewed or smaller datasets. Note that this function creates a histogram with non-uniform bin widths. It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by the R/S-Plus statistical computing packages. Note: the output type of the ‘x’ field in the return value is propagated from the input value consumed in the aggregate function.

New in version 3.5.0.

Parameters

colColumn or str: target column to work on.
nBinsColumn or str: number of Histogram columns.

Returns

Column: a histogram on numeric ‘col’ using nb bins.

Examples

>>> df = spark.createDataFrame([("a", 1),
...                             ("a", 2),
...                             ("a", 3),
...                             ("b", 8),
...                             ("b", 2)], ["c1", "c2"])
>>> df.select(histogram_numeric('c2', lit(5))).show()
+------------------------+
|histogram_numeric(c2, 5)|
+------------------------+
|    [{1, 1.0}, {2, 1....|
+------------------------+