repartitionById

Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is partitioned by column identifier.

Syntax

repartitionById(numPartitions: int, *cols: "ColumnOrName")

Parameters

Parameter	Type	Description
`numPartitions`	int	the target number of partitions.
`cols`	str or Column	partitioning columns.

Returns

DataFrame: Repartitioned DataFrame.

Notes

At least one partition-by expression must be specified. This is similar to repartition in distribution, but preserves the ordering of the rows within each partition.

This is an experimental API.

Examples

from pyspark.sql import functions as sf
spark.createDataFrame(
    [(14, "Tom"), (23, "Alice"), (16, "Bob"), (18, "Alice"), (21, "Alice")],
    ["age", "name"]
).repartitionById(2, "name").select(
    "age", "name", sf.spark_partition_id()
).show()
# +---+-----+--------------------+
# |age| name|SPARK_PARTITION_ID()|
# +---+-----+--------------------+
# | 14|  Tom|                   0|
# | 23|Alice|                   1|
# | 18|Alice|                   1|
# | 21|Alice|                   1|
# | 16|  Bob|                   0|
# +---+-----+--------------------+

Phản hồi

Trang này có hữu ích không?

Last updated on 2026-04-17