histogram_numeric

Computes a histogram on numeric 'col' using nb bins. The return value is an array of (x,y) pairs representing the centers of the histogram's bins. As the value of 'nb' is increased, the histogram approximation gets finer-grained, but may yield artifacts around outliers. In practice, 20-40 histogram bins appear to work well, with more bins being required for skewed or smaller datasets. Note that this function creates a histogram with non-uniform bin widths. It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by the R/S-Plus statistical computing packages. Note: the output type of the 'x' field in the return value is propagated from the input value consumed in the aggregate function.

Syntax

from pyspark.sql import functions as sf

sf.histogram_numeric(col, nBins)

Parameters

Parameter	Type	Description
`col`	`pyspark.sql.Column` or str	Target column to work on.
`nBins`	`pyspark.sql.Column`	Number of histogram columns.

Returns

pyspark.sql.Column: a histogram on numeric 'col' using nb bins.

Examples

Example 1: Compute histogram with 5 bins

from pyspark.sql import functions as sf
df = spark.range(100, numPartitions=1)
df.select(sf.histogram_numeric('id', sf.lit(5))).show(truncate=False)

+-----------------------------------------------------------+
|histogram_numeric(id, 5)                                   |
+-----------------------------------------------------------+
|[{11, 25.0}, {36, 24.0}, {59, 23.0}, {84, 25.0}, {98, 3.0}]|
+-----------------------------------------------------------+

Atsauksmes

Vai šī lapa palīdzēja?

Last updated on 2026-01-29