Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Returns the top k most frequently occurring item values in a string, boolean, date, timestamp, or numeric column col along with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the total number of rows. k (default: 5) and maxItemsTracked (default: 10000) are both integer parameters. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage. Columns that have fewer than maxItemsTracked distinct items will yield exact item counts. NULL values are included as their own value in the results.
Results are returned as an array of structs containing item values (with their original input type) and their occurrence count (long type), sorted by count descending.
Syntax
from pyspark.databricks.sql import functions as dbsf
dbsf.approx_top_k(col, k=5, maxItemsTracked=10000)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
Column to find top k items from. |
k |
pyspark.sql.Column or int, optional |
Number of top items to return. Default is 5. |
maxItemsTracked |
pyspark.sql.Column or int, optional |
Maximum number of distinct items to track. Default is 10000. Higher values provide better accuracy at the cost of increased memory usage. |
Examples
from pyspark.sql.functions import col
from pyspark.databricks.sql.functions import approx_top_k
item = (col("id") % 3).alias("item")
df = spark.range(0, 1000, 1, 1).select(item)
df.select(
approx_top_k("item", 5).alias("top_k")
).printSchema()
root
|-- top_k: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- item: long (nullable = true)
| | |-- count: long (nullable = false)