Бележка
Достъпът до тази страница изисква удостоверяване. Можете да опитате да влезете или да промените директориите.
Достъпът до тази страница изисква удостоверяване. Можете да опитате да промените директориите.
Collects the values from a column into a set, eliminating duplicates, and returns this set of objects. This function is non-deterministic as the order of collected results depends on the order of the rows, which may be non-deterministic after any shuffle operations.
Syntax
from pyspark.sql import functions as sf
sf.collect_set(col)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
The target column on which the function is computed. |
Returns
pyspark.sql.Column: A new Column object representing a set of collected values, duplicates excluded.
Examples
Example 1: Collect values from a DataFrame and sort the result in ascending order
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1,), (2,), (2,)], ('value',))
df.select(sf.sort_array(sf.collect_set('value')).alias('sorted_set')).show()
+----------+
|sorted_set|
+----------+
| [1, 2]|
+----------+
Example 2: Collect values from a DataFrame and sort the result in descending order
from pyspark.sql import functions as sf
df = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
df.select(sf.sort_array(sf.collect_set('age'), asc=False).alias('sorted_set')).show()
+----------+
|sorted_set|
+----------+
| [5, 2]|
+----------+
Example 3: Collect values from a DataFrame with multiple columns and sort the result
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name"))
df = df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set'))
df.orderBy(sf.desc("name")).show()
+----+----------+
|name|sorted_set|
+----+----------+
|John| [1, 2]|
| Ana| [3]|
+----+----------+