Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
Syntax
from pyspark.sql import functions as sf
sf.count_min_sketch(col, eps, confidence, seed=None)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or str |
Target column to compute on. |
eps |
pyspark.sql.Column or float |
Relative error, must be positive. |
confidence |
pyspark.sql.Column or float |
Confidence, must be positive and less than 1.0. |
seed |
pyspark.sql.Column or int, optional |
Random seed. |
Returns
pyspark.sql.Column: count-min sketch of the column
Examples
Example 1: Using columns as arguments
from pyspark.sql import functions as sf
spark.range(100).select(
sf.hex(sf.count_min_sketch(sf.col("id"), sf.lit(3.0), sf.lit(0.1), sf.lit(1)))
).show(truncate=False)
+------------------------------------------------------------------------+
|hex(count_min_sketch(id, 3.0, 0.1, 1)) |
+------------------------------------------------------------------------+
|0000000100000000000000640000000100000001000000005D8D6AB90000000000000064|
+------------------------------------------------------------------------+
Example 2: Using numbers as arguments
from pyspark.sql import functions as sf
spark.range(100).select(
sf.hex(sf.count_min_sketch("id", 1.0, 0.3, 2))
).show(truncate=False)
+----------------------------------------------------------------------------------------+
|hex(count_min_sketch(id, 1.0, 0.3, 2)) |
+----------------------------------------------------------------------------------------+
|0000000100000000000000640000000100000002000000005D96391C00000000000000320000000000000032|
+----------------------------------------------------------------------------------------+
Example 3: Using a long seed
from pyspark.sql import functions as sf
spark.range(100).select(
sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.2, 1111111111111111111))
).show(truncate=False)
+----------------------------------------------------------------------------------------+
|hex(count_min_sketch(id, 1.5, 0.2, 1111111111111111111)) |
+----------------------------------------------------------------------------------------+
|00000001000000000000006400000001000000020000000044078BA100000000000000320000000000000032|
+----------------------------------------------------------------------------------------+