approx_top_k

Returns the top k most frequently occurring item values in a string, boolean, date, timestamp, or numeric column col along with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the total number of rows. k (default: 5) and maxItemsTracked (default: 10000) are both integer parameters. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage. Columns that have fewer than maxItemsTracked distinct items will yield exact item counts. NULL values are included as their own value in the results.

Results are returned as an array of structs containing item values (with their original input type) and their occurrence count (long type), sorted by count descending.

Syntax

from pyspark.databricks.sql import functions as dbsf

dbsf.approx_top_k(col, k=5, maxItemsTracked=10000)

Parameters

Parameter	Type	Description
`col`	`pyspark.sql.Column` or column name	Column to find top k items from.
`k`	`pyspark.sql.Column` or int, optional	Number of top items to return. Default is 5.
`maxItemsTracked`	`pyspark.sql.Column` or int, optional	Maximum number of distinct items to track. Default is 10000. Higher values provide better accuracy at the cost of increased memory usage.

Examples

from pyspark.sql.functions import col
from pyspark.databricks.sql.functions import approx_top_k
item = (col("id") % 3).alias("item")
df = spark.range(0, 1000, 1, 1).select(item)
df.select(
   approx_top_k("item", 5).alias("top_k")
).printSchema()

root
 |-- top_k: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- item: long (nullable = true)
 |    |    |-- count: long (nullable = false)

Feedback

Was this page helpful?

Last updated on 2026-01-29