azure databrick pyspark combine two dataframe

Pankaj Joshi 391 Reputation points
2024-08-04T06:21:28.9833333+00:00

I have two dataframe in databrick notebook as df_1 and df_2

I want output from these two dataframe as df_final , shown in image below.

It should return matching row by (ID) from df_1 and only non matching row from df_2

Please let me know how I can achieve this in databrick using pyspark, should I use any join , filter or subtract ?

User's image

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,514 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA 90,641 Reputation points Moderator
    2024-08-05T04:45:32.9566667+00:00

    @Pankaj Joshi - Thanks for the question and using MS Q&A platform.

    This is the pyspark code to achieve to return matching row by (ID) from df_1 and only non matching row from df_2:

    df_matching = df_1.join(df_2, on='ID', how='inner')
    df_nonmatching = df_2.join(df_1, on='ID', how='left_anti')
    
    df_result = df_matching.union(df_nonmatching)
    
    display(df_result)
    

    Note: When you run the above code, you will see the error message:

    Because your data contains for ID 100 as mike and mike2 and 104 as henry and henry2.

    User's image

    Reason: UNION can only be performed on inputs with the same number of columns, but the first input has 5 columns and the second input has 3 columns.

    User's image

    Slight modification to the code would works as expected:

    Explanation:

    • inner join is used to find rows that have matching ID in both dataframes.
    • left_anti join is used to find rows in df_2 that do not have a matching ID in df_1.
    • union is used to combine the matching rows from df_1 and non-matching rows from df_2.

    This should give you the desired output as df_final containing matching rows from df_1 based on ID and non-matching rows from df_2.

    # Perform an inner join to get matching rows based on the ID
    df_matching = df_1.join(df_2, on="ID", how="inner").select(df_1["*"])
    # Get the non-matching rows from df_2 using a left anti join
    df_non_matching = df_2.join(df_1, on="ID", how="left_anti")
    # Union the matching rows from df_1 with the non-matching rows from df_2
    df_final = df_matching.union(df_non_matching)
    # Show the final dataframe
    df_final.show()
    
    

    User's image

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Vinodh247 34,661 Reputation points MVP Volunteer Moderator
    2024-08-04T14:17:27.88+00:00

    you can use a combination of join() and except() in pyspark. Try the below code and let me know if it works

    # Join df_1 and df_2 on ID column 
    df_joined = df_1.join(df_2, on="ID", how="outer") 
    # Get non-matching rows from df_2 
    df_nonmatching = df_joined.where((df_joined.ID.isNotNull()) & (df_joined["ID_1"].isNull())) # Select desired columns 
    df_final = df_nonmatching.select("ID", "Name_1", "Number_1", "Name_2", "Number_2")
    
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.