नोट
इस पेज तक पहुँच के लिए प्रमाणन की आवश्यकता होती है. आप साइन इन करने या निर्देशिकाओं को बदलने का प्रयास कर सकते हैं.
इस पेज तक पहुँच के लिए प्रमाणन की आवश्यकता होती है. आप निर्देशिकाओं को बदलने का प्रयास कर सकते हैं.
Returns a new Column, which estimates the approximate distinct count of elements in a specified column or a group of columns.
Syntax
from pyspark.sql import functions as sf
sf.approx_count_distinct(col, rsd=None)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
The label of the column to count distinct values in. |
rsd |
float, optional | The maximum allowed relative standard deviation (default = 0.05). If rsd < 0.01, it would be more efficient to use count_distinct. |
Returns
pyspark.sql.Column: A new Column object representing the approximate unique count.
Examples
Example 1: Counting distinct values in a single column DataFrame representing integers
from pyspark.sql import functions as sf
df = spark.createDataFrame([1,2,2,3], "int")
df.agg(sf.approx_count_distinct("value")).show()
+----------------------------+
|approx_count_distinct(value)|
+----------------------------+
| 3|
+----------------------------+
Example 2: Counting distinct values in a single column DataFrame representing strings
from pyspark.sql import functions as sf
df = spark.createDataFrame([("apple",), ("orange",), ("apple",), ("banana",)], ['fruit'])
df.agg(sf.approx_count_distinct("fruit")).show()
+----------------------------+
|approx_count_distinct(fruit)|
+----------------------------+
| 3|
+----------------------------+
Example 3: Counting distinct values in a DataFrame with multiple columns
from pyspark.sql import functions as sf
df = spark.createDataFrame(
[("Alice", 1), ("Alice", 2), ("Bob", 3), ("Bob", 3)], ["name", "value"])
df = df.withColumn("combined", sf.struct("name", "value"))
df.agg(sf.approx_count_distinct(df.combined)).show()
+-------------------------------+
|approx_count_distinct(combined)|
+-------------------------------+
| 3|
+-------------------------------+
Example 4: Counting distinct values with a specified relative standard deviation
from pyspark.sql import functions as sf
spark.range(100000).agg(
sf.approx_count_distinct("id").alias('with_default_rsd'),
sf.approx_count_distinct("id", 0.1).alias('with_rsd_0.1')
).show()
+----------------+------------+
|with_default_rsd|with_rsd_0.1|
+----------------+------------+
| 95546| 102065|
+----------------+------------+