Share via


percentile

Returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].

Syntax

from pyspark.sql import functions as sf

sf.percentile(col, percentage, frequency=1)

Parameters

Parameter Type Description
col pyspark.sql.Column or str The numeric column.
percentage pyspark.sql.Column, float, list of floats or tuple of floats Percentage in decimal (must be between 0.0 and 1.0).
frequency pyspark.sql.Column or int A positive numeric literal which controls frequency (default: 1).

Returns

pyspark.sql.Column: the exact percentile of the numeric column.

Examples

Example 1: Calculate multiple percentiles

from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.select(
    sf.percentile("value", [0.25, 0.5, 0.75], sf.lit(1))
).show(truncate=False)
+--------------------------------------------------------+
|percentile(value, array(0.25, 0.5, 0.75), 1)            |
+--------------------------------------------------------+
|[0.7441991494121..., 9.9900713756..., 19.33740203080...]|
+--------------------------------------------------------+

Example 2: Calculate percentile by group

from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.groupBy("key").agg(
    sf.percentile("value", sf.lit(0.5), sf.lit(1))
).sort("key").show()
+---+-------------------------+
|key|percentile(value, 0.5, 1)|
+---+-------------------------+
|  0|     -0.03449962216667...|
|  1|        9.990389751837...|
|  2|       19.967859769284...|
+---+-------------------------+