Σημείωση
Η πρόσβαση σε αυτή τη σελίδα απαιτεί εξουσιοδότηση. Μπορείτε να δοκιμάσετε να συνδεθείτε ή να αλλάξετε καταλόγους.
Η πρόσβαση σε αυτή τη σελίδα απαιτεί εξουσιοδότηση. Μπορείτε να δοκιμάσετε να αλλάξετε καταλόγους.
Collects the values from a column into a set, eliminating duplicates, and returns this set of objects. This function is non-deterministic as the order of collected results depends on the order of the rows, which may be non-deterministic after any shuffle operations.
Syntax
from pyspark.sql import functions as sf
sf.collect_set(col)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
The target column on which the function is computed. |
Returns
pyspark.sql.Column: A new Column object representing a set of collected values, duplicates excluded.
Examples
Example 1: Collect values from a DataFrame and sort the result in ascending order
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1,), (2,), (2,)], ('value',))
df.select(sf.sort_array(sf.collect_set('value')).alias('sorted_set')).show()
+----------+
|sorted_set|
+----------+
| [1, 2]|
+----------+
Example 2: Collect values from a DataFrame and sort the result in descending order
from pyspark.sql import functions as sf
df = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
df.select(sf.sort_array(sf.collect_set('age'), asc=False).alias('sorted_set')).show()
+----------+
|sorted_set|
+----------+
| [5, 2]|
+----------+
Example 3: Collect values from a DataFrame with multiple columns and sort the result
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name"))
df = df.groupBy("name").agg(sf.sort_array(sf.collect_set('id')).alias('sorted_set'))
df.orderBy(sf.desc("name")).show()
+----+----------+
|name|sorted_set|
+----+----------+
|John| [1, 2]|
| Ana| [3]|
+----+----------+