Piezīmes
Lai piekļūtu šai lapai, ir nepieciešama autorizācija. Varat mēģināt pierakstīties vai mainīt direktorijus.
Lai piekļūtu šai lapai, ir nepieciešama autorizācija. Varat mēģināt mainīt direktorijus.
Returns the top k most frequently occurring item values in a string, boolean, date, timestamp, or numeric column col along with their approximate counts. The error in each count may be up to 2.0 * numRows / maxItemsTracked where numRows is the total number of rows. k (default: 5) and maxItemsTracked (default: 10000) are both integer parameters. Higher values of maxItemsTracked provide better accuracy at the cost of increased memory usage. Columns that have fewer than maxItemsTracked distinct items will yield exact item counts. NULL values are included as their own value in the results.
Results are returned as an array of structs containing item values (with their original input type) and their occurrence count (long type), sorted by count descending.
Syntax
from pyspark.databricks.sql import functions as dbsf
dbsf.approx_top_k(col, k=5, maxItemsTracked=10000)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
Column to find top k items from. |
k |
pyspark.sql.Column or int, optional |
Number of top items to return. Default is 5. |
maxItemsTracked |
pyspark.sql.Column or int, optional |
Maximum number of distinct items to track. Default is 10000. Higher values provide better accuracy at the cost of increased memory usage. |
Examples
from pyspark.sql.functions import col
from pyspark.databricks.sql.functions import approx_top_k
item = (col("id") % 3).alias("item")
df = spark.range(0, 1000, 1, 1).select(item)
df.select(
approx_top_k("item", 5).alias("top_k")
).printSchema()
root
|-- top_k: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- item: long (nullable = true)
| | |-- count: long (nullable = false)