How to maintain uniqueness in dataframe in databricks

Abhishek Gaikwad 191 Reputation points
2020-10-13T13:01:40.6+00:00

When you load data into dataframes in databricks how can you make sure the rows in dataframes are not duplicated.
In SQL you can handle by using unique constraint on the tables. how this can be handle in dataframes to ensure rows are not duplicated.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,857 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Nandan CB 1 Reputation point
    2020-10-13T13:04:36.937+00:00

    If df is the name of your DataFrame, there are two ways to get unique rows:

    df2 = df.distinct()

    or

    df2 = df.drop_duplicates()

    0 comments No comments

  2. Florent Pousserot 6 Reputation points
    2020-10-13T21:39:13.643+00:00

    Hi,

    There is no locking mechanism of PK in Delta, but you can create surrogate keys in different ways

    1°) You can watch this good article for example :

    https://www.linkedin.com/pulse/creating-surrogate-keys-databricks-delta-using-spark-sql-patel/

    2°) Once this key is created, you can use the merge function :

    https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge

    If your usecase is "Data deduplication when writing into Delta tables", so with merge, you can avoid inserting the duplicate records.

    MERGE INTO logs
    USING newDedupedLogs
    ON logs.uniqueId = newDedupedLogs.uniqueId
    WHEN NOT MATCHED
      THEN INSERT *
    

    https://docs.delta.io/latest/delta-update.html#data-deduplication-when-writing-into-delta-tables

    0 comments No comments