Lưu ý
Cần có ủy quyền mới truy nhập được vào trang này. Bạn có thể thử đăng nhập hoặc thay đổi thư mục.
Cần có ủy quyền mới truy nhập được vào trang này. Bạn có thể thử thay đổi thư mục.
Applies to:
Databricks SQL
Databricks Runtime 10.2 and above
Computes a histogram on expr using numBins bins.
Syntax
histogram_numeric ( [ALL | DISTINCT ] expr, numBins ) [ FILTER ( WHERE cond ) ]
Arguments
expr: A numeric,TIMESTAMP,DATE, orINTERVALexpression that the function consumes and computes the histogram on.numBins: AnINTEGERliteral that must be greater than 1, specifying the number of bins for the histogram computation.cond: An optionalBOOLEANexpression that filters the rows for aggregation.
Returns
The return value is an ARRAY of STRUCTS with fields x and y representing the centers of the histogram's bins.
The type of x is the same as the type of expr, and the type of y is DOUBLE.
Increasing the value of numBins refines the histogram approximation, making it finer-grained. However, it can introduce artifacts around outliers.
Typically, 20-40 bins are effective for histograms, though skewed or smaller datasets might require more bins.
Note that this function creates a histogram with non-uniform bin widths.
It offers no guarantees in terms of the mean-squared-error of the histogram, but in practice is comparable to the histograms produced by other computing packages.
Specifying DISTINCT makes the function operate only on a unique set of expr values.
Examples
> SELECT histogram_numeric(col, 5)
FROM VALUES (0), (1), (2), (10) AS tab(col);
[{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
> SELECT histogram_numeric(col, 5)
FROM VALUES (0L), (1L), (2L), (10L) AS tab(col);
[{"x":0,"y":1.0},{"x":1,"y":1.0},{"x":2,"y":1.0},{"x":10,"y":1.0}]
> SELECT histogram_numeric(col, 5)
FROM VALUES (0F), (1F), (2F), (10F) AS tab(col);
[{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
> SELECT histogram_numeric(col, 5)
FROM VALUES (0D), (1D), (2D), (10D) AS tab(col);
[{"x":0.0,"y":1.0},{"x":1.0,"y":1.0},{"x":2.0,"y":1.0},{"x":10.0,"y":1.0}]
> SELECT histogram_numeric(col, 5)
FROM VALUES (INTERVAL 0 YEAR), (INTERVAL 1 YEAR), (INTERVAL 2 YEAR),
(INTERVAL 3 YEAR) AS tab(col);
[{"x":0-0,"y":1.0},{"x":1-0,"y":1.0},{"x":2-0,"y":1.0},{"x":3-0,"y":1.0}]
> SELECT histogram_numeric(col, 5)
FROM VALUES (INTERVAL 0 DAY), (INTERVAL 1 DAY), (INTERVAL 2 DAY),
(INTERVAL 3 DAY) AS tab(col);
[{"x":0 00:00:00.000000000,"y":1.0},{"x":1 00:00:00.000000000,"y":1.0},{"x":2 00:00:00.000000000,"y":1.0},{"x":3 00:00:00.000000000,"y":1.0}]
> SELECT histogram_numeric(col, 5)
FROM VALUES (TIMESTAMP '2020-01-01'), (TIMESTAMP'2020-02-01'),
(TIMESTAMP'2020-03-01'), (TIMESTAMP'2020-10-01') AS tab(col)
[{"x":2020-01-01 00:00:00,"y":1.0},{"x":2020-02-01 00:00:00,"y":1.0},{"x":2020-03-01 00:00:00,"y":1.0},{"x":2020-10-01 00:00:00,"y":1.0}]