Share via


shuffle

Generates a random permutation of the given array. The shuffle function is non-deterministic, meaning the order of the output array can be different for each execution.

Syntax

from pyspark.sql import functions as sf

sf.shuffle(col, seed=None)

Parameters

Parameter Type Description
col pyspark.sql.Column or str The name of the column or expression to be shuffled.
seed pyspark.sql.Column or int, optional Seed value for the random generator.

Returns

pyspark.sql.Column: A new column that contains an array of elements in random order.

Examples

Example 1: Shuffling a simple array

import pyspark.sql.functions as sf
df = spark.sql("SELECT ARRAY(1, 20, 3, 5) AS data")
df.select("*", sf.shuffle(df.data, sf.lit(123))).show()
+-------------+-------------+
|         data|shuffle(data)|
+-------------+-------------+
|[1, 20, 3, 5]|[5, 1, 20, 3]|
+-------------+-------------+

Example 2: Shuffling an array with null values

import pyspark.sql.functions as sf
df = spark.sql("SELECT ARRAY(1, 20, NULL, 5) AS data")
df.select("*", sf.shuffle(sf.col("data"), 234)).show()
+----------------+----------------+
|            data|   shuffle(data)|
+----------------+----------------+
|[1, 20, NULL, 5]|[NULL, 5, 20, 1]|
+----------------+----------------+

Example 3: Shuffling an array with duplicate values

import pyspark.sql.functions as sf
df = spark.sql("SELECT ARRAY(1, 2, 2, 3, 3, 3) AS data")
df.select("*", sf.shuffle("data", 345)).show()
+------------------+------------------+
|              data|     shuffle(data)|
+------------------+------------------+
|[1, 2, 2, 3, 3, 3]|[2, 3, 3, 1, 2, 3]|
+------------------+------------------+