Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Utility functions for defining window in DataFrames.
Supports Spark Connect
Class attributes
| Attribute | Description |
|---|---|
unboundedPreceding |
Boundary value representing the start of an unbounded window frame. |
unboundedFollowing |
Boundary value representing the end of an unbounded window frame. |
currentRow |
Boundary value representing the current row in a window frame. |
Methods
| Method | Description |
|---|---|
orderBy(*cols) |
Creates a WindowSpec with the ordering defined. |
partitionBy(*cols) |
Creates a WindowSpec with the partitioning defined. |
rangeBetween(start, end) |
Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive), using range-based offsets from the current row's ORDER BY value. |
rowsBetween(start, end) |
Creates a WindowSpec with the frame boundaries defined, from start (inclusive) to end (inclusive), using row-based offsets from the current row. |
Notes
When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.
Examples
Basic window with ordering and row frame
from pyspark.sql import Window
# ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
window = Window.orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
Partitioned window with range frame
from pyspark.sql import Window
# PARTITION BY country ORDER BY date RANGE BETWEEN 3 PRECEDING AND 3 FOLLOWING
window = Window.orderBy("date").partitionBy("country").rangeBetween(-3, 3)
Row number within partition
from pyspark.sql import Window, functions as sf
df = spark.createDataFrame(
[(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")], ["id", "category"]
)
# Show row number ordered by id within each category partition
window = Window.partitionBy("category").orderBy("id")
df.withColumn("row_number", sf.row_number().over(window)).show()
Running sum with row-based frame
from pyspark.sql import Window, functions as sf
df = spark.createDataFrame(
[(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")], ["id", "category"]
)
# Sum id values from the current row to the next row within each partition
window = Window.partitionBy("category").orderBy("id").rowsBetween(Window.currentRow, 1)
df.withColumn("sum", sf.sum("id").over(window)).sort("id", "category", "sum").show()
Running sum with range-based frame
from pyspark.sql import Window, functions as sf
df = spark.createDataFrame(
[(1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b")], ["id", "category"]
)
# Sum id values from the current id value to id + 1 within each partition
window = Window.partitionBy("category").orderBy("id").rangeBetween(Window.currentRow, 1)
df.withColumn("sum", sf.sum("id").over(window)).sort("id", "category").show()