刪除重複項

回傳一個新的 DataFrame，移除重複列，且可選擇只考慮特定欄位。

語法

dropDuplicates(subset: Optional[List[str]] = None)

參數

參數	類型	說明
`subset`	欄位名稱列表，選用	用於重複比較的欄位列表（預設為全部欄位）。

退貨

DataFrame：沒有重複資料的 DataFrame。

Notes

對於靜態批次資料框架，它只是丟棄重複的列。對於串流資料框架，它會將所有觸發器資料作為中間狀態，以便丟棄重複的列。你可以用 withWatermark 來限制重複資料的延遲，系統會相應地限制該州。此外，為避免重複，將刪除浮水印之前的資料。

Examples

from pyspark.sql import Row
df = spark.createDataFrame([
    Row(name='Alice', age=5, height=80),
    Row(name='Alice', age=5, height=80),
    Row(name='Alice', age=10, height=80)
])

df.dropDuplicates().show()
# +-----+---+------+
# | name|age|height|
# +-----+---+------+
# |Alice|  5|    80|
# |Alice| 10|    80|
# +-----+---+------+

df.dropDuplicates(['name', 'height']).show()
# +-----+---+------+
# | name|age|height|
# +-----+---+------+
# |Alice|  5|    80|
# +-----+---+------+

意見反應

此頁面對您有幫助嗎？

Last updated on 2026-04-19

刪除重複項

語法

參數

退貨

Notes

Examples

意見反應

其他資源