Share via


theta_union_agg

Aggregate function: returns the compact binary representation of the Datasketches ThetaSketch that is the union of the Theta sketches in the input column.

Syntax

from pyspark.databricks.sql import functions as dbf

dbf.theta_union_agg(col=<col>, lgNomEntries=<lgNomEntries>)

Parameters

Parameter Type Description
col pyspark.sql.Column or column name The column containing Theta sketches to union.
lgNomEntries pyspark.sql.Column or int, optional The log-base-2 of nominal entries for the union operation (must be between 4 and 26, defaults to 12).

Returns

pyspark.sql.Column: The binary representation of the merged ThetaSketch.

Examples

from pyspark.databricks.sql import functions as dbf
df1 = spark.createDataFrame([1,2,2,3], "INT")
df1 = df1.agg(dbf.theta_sketch_agg("value").alias("sketch"))
df2 = spark.createDataFrame([4,5,5,6], "INT")
df2 = df2.agg(dbf.theta_sketch_agg("value").alias("sketch"))
df3 = df1.union(df2)
df3.agg(dbf.theta_sketch_estimate(dbf.theta_union_agg("sketch"))).show()
+--------------------------------------------------+
|theta_sketch_estimate(theta_union_agg(sketch, 12))|
+--------------------------------------------------+
|                                                 6|
+--------------------------------------------------+