Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Returns the exact percentile(s) of numeric column expr at the given percentage(s) with value range in [0.0, 1.0].
Syntax
from pyspark.sql import functions as sf
sf.percentile(col, percentage, frequency=1)
Parameters
| Parameter | Type | Description |
|---|---|---|
col |
pyspark.sql.Column or str |
The numeric column. |
percentage |
pyspark.sql.Column, float, list of floats or tuple of floats |
Percentage in decimal (must be between 0.0 and 1.0). |
frequency |
pyspark.sql.Column or int |
A positive numeric literal which controls frequency (default: 1). |
Returns
pyspark.sql.Column: the exact percentile of the numeric column.
Examples
Example 1: Calculate multiple percentiles
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.select(
sf.percentile("value", [0.25, 0.5, 0.75], sf.lit(1))
).show(truncate=False)
+--------------------------------------------------------+
|percentile(value, array(0.25, 0.5, 0.75), 1) |
+--------------------------------------------------------+
|[0.7441991494121..., 9.9900713756..., 19.33740203080...]|
+--------------------------------------------------------+
Example 2: Calculate percentile by group
from pyspark.sql import functions as sf
key = (sf.col("id") % 3).alias("key")
value = (sf.randn(42) + key * 10).alias("value")
df = spark.range(0, 1000, 1, 1).select(key, value)
df.groupBy("key").agg(
sf.percentile("value", sf.lit(0.5), sf.lit(1))
).sort("key").show()
+---+-------------------------+
|key|percentile(value, 0.5, 1)|
+---+-------------------------+
| 0| -0.03449962216667...|
| 1| 9.990389751837...|
| 2| 19.967859769284...|
+---+-------------------------+