How to maintain uniqueness in dataframe in databricks

Question

When you load data into dataframes in databricks how can you make sure the rows in dataframes are not duplicated.
In SQL you can handle by using unique constraint on the tables. how this can be handle in dataframes to ensure rows are not duplicated.

Answer

If df is the name of your DataFrame, there are two ways to get unique rows:

df2 = df.distinct()

or

df2 = df.drop_duplicates()

Answer

Hi,

There is no locking mechanism of PK in Delta, but you can create surrogate keys in different ways

1°) You can watch this good article for example :

https://www.linkedin.com/pulse/creating-surrogate-keys-databricks-delta-using-spark-sql-patel/

2°) Once this key is created, you can use the merge function :

https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge

If your usecase is "Data deduplication when writing into Delta tables", so with merge, you can avoid inserting the duplicate records.

MERGE INTO logs
USING newDedupedLogs
ON logs.uniqueId = newDedupedLogs.uniqueId
WHEN NOT MATCHED
  THEN INSERT *

https://docs.delta.io/latest/delta-update.html#data-deduplication-when-writing-into-delta-tables

Share via

How to maintain uniqueness in dataframe in databricks

2 answers