Share via


hll_union

Merges two binary representations of Datasketches HllSketch objects, using a Datasketches Union object. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is unset or set to false.

Syntax

from pyspark.sql import functions as sf

sf.hll_union(col1, col2, allowDifferentLgConfigK=None)

Parameters

Parameter Type Description
col1 pyspark.sql.Column or str The first HLL sketch.
col2 pyspark.sql.Column or str The second HLL sketch.
allowDifferentLgConfigK bool, optional Allow sketches with different lgConfigK values to be merged (defaults to false).

Returns

pyspark.sql.Column: The binary representation of the merged HllSketch.

Examples

Example 1: Union two HLL sketches

from pyspark.sql import functions as sf
df = spark.createDataFrame([(1,4),(2,5),(2,5),(3,6)], "struct<v1:int,v2:int>")
df = df.agg(
    sf.hll_sketch_agg("v1").alias("sketch1"),
    sf.hll_sketch_agg("v2").alias("sketch2")
)
df.select(sf.hll_sketch_estimate(sf.hll_union(df.sketch1, "sketch2"))).show()
+-------------------------------------------------------+
|hll_sketch_estimate(hll_union(sketch1, sketch2, false))|
+-------------------------------------------------------+
|                                                      6|
+-------------------------------------------------------+