Бележка
Достъпът до тази страница изисква удостоверяване. Можете да опитате да влезете или да промените директориите.
Достъпът до тази страница изисква удостоверяване. Можете да опитате да промените директориите.
Collects the values from a column into a list, maintaining duplicates, and returns this list of objects. The function is non-deterministic as the order of collected results depends on the order of the rows, which possibly becomes non-deterministic after shuffle operations.
Syntax
from pyspark.sql import functions as sf
sf.collect_list(col)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or column name |
The target column on which the function is computed. |
Returns
pyspark.sql.Column: A new Column object representing a list of collected values, with duplicate values included.
Examples
Example 1: Collect values from a DataFrame and sort the result in ascending order
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1,), (2,), (2,)], ('value',))
df.select(sf.sort_array(sf.collect_list('value')).alias('sorted_list')).show()
+-----------+
|sorted_list|
+-----------+
| [1, 2, 2]|
+-----------+
Example 2: Collect values from a DataFrame and sort the result in descending order
from pyspark.sql import functions as sf
df = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
df.select(sf.sort_array(sf.collect_list('age'), asc=False).alias('sorted_list')).show()
+-----------+
|sorted_list|
+-----------+
| [5, 5, 2]|
+-----------+
Example 3: Collect values from a DataFrame with multiple columns and sort the result
from pyspark.sql import functions as sf
df = spark.createDataFrame([(1, "John"), (2, "John"), (3, "Ana")], ("id", "name"))
df = df.groupBy("name").agg(sf.sort_array(sf.collect_list('id')).alias('sorted_list'))
df.orderBy(sf.desc("name")).show()
+----+-----------+
|name|sorted_list|
+----+-----------+
|John| [1, 2]|
| Ana| [3]|
+----+-----------+