Share via


count_min_sketch

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

Syntax

from pyspark.sql import functions as sf

sf.count_min_sketch(col, eps, confidence, seed=None)

Parameters

Parameter Type Description
col pyspark.sql.Column or str Target column to compute on.
eps pyspark.sql.Column or float Relative error, must be positive.
confidence pyspark.sql.Column or float Confidence, must be positive and less than 1.0.
seed pyspark.sql.Column or int, optional Random seed.

Returns

pyspark.sql.Column: count-min sketch of the column

Examples

Example 1: Using columns as arguments

from pyspark.sql import functions as sf
spark.range(100).select(
    sf.hex(sf.count_min_sketch(sf.col("id"), sf.lit(3.0), sf.lit(0.1), sf.lit(1)))
).show(truncate=False)
+------------------------------------------------------------------------+
|hex(count_min_sketch(id, 3.0, 0.1, 1))                                  |
+------------------------------------------------------------------------+
|0000000100000000000000640000000100000001000000005D8D6AB90000000000000064|
+------------------------------------------------------------------------+

Example 2: Using numbers as arguments

from pyspark.sql import functions as sf
spark.range(100).select(
    sf.hex(sf.count_min_sketch("id", 1.0, 0.3, 2))
).show(truncate=False)
+----------------------------------------------------------------------------------------+
|hex(count_min_sketch(id, 1.0, 0.3, 2))                                                  |
+----------------------------------------------------------------------------------------+
|0000000100000000000000640000000100000002000000005D96391C00000000000000320000000000000032|
+----------------------------------------------------------------------------------------+

Example 3: Using a long seed

from pyspark.sql import functions as sf
spark.range(100).select(
    sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.2, 1111111111111111111))
).show(truncate=False)
+----------------------------------------------------------------------------------------+
|hex(count_min_sketch(id, 1.5, 0.2, 1111111111111111111))                                |
+----------------------------------------------------------------------------------------+
|00000001000000000000006400000001000000020000000044078BA100000000000000320000000000000032|
+----------------------------------------------------------------------------------------+