azure databrick pyspark combine two dataframe

Question

azure databrick pyspark combine two dataframe

Pankaj Joshi 391

I have two dataframe in databrick notebook as df_1 and df_2

I want output from these two dataframe as df_final , shown in image below.

It should return matching row by (ID) from df_1 and only non matching row from df_2

Please let me know how I can achieve this in databrick using pyspark, should I use any join , filter or subtract ?

User's image

Kumar, Addala (623) 5 Reputation points

2024-08-04T14:55:37.9666667+00:00

Hi Pankaj,

You can achieve the above output by

#Getting matched records matched_df = df1.join(df2, on=["id"], how="leftsemi") # Returns the records from df1 if the id is present in df2

#Getting Unmatched records from df1

unmatched_df = df2.join(df1, on=["id"], how="leftanti") # Returns the records from df2 if the id is not present in df1

output_df = matched_df.unionByName(unmatched_df)

Please let me know if you have any other queries.
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-08-05T00:38:23.61+00:00

Pankaj Joshi: Test and let me know if the above solution worked
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-08-07T12:45:15.7033333+00:00

Pankaj Joshi: Please confirm if you had a chance to try any of the suggestions provided for your question.
Pankaj Joshi 391 Reputation points

2024-08-10T08:26:18.84+00:00

It work thanks

Accepted answer

1 additional answer

Your answer

Kumar, Addala (623) 5 Reputation points

2024-08-04T14:55:37.9666667+00:00

Hi Pankaj,

You can achieve the above output by

#Getting matched records matched_df = df1.join(df2, on=["id"], how="leftsemi") # Returns the records from df1 if the id is present in df2

#Getting Unmatched records from df1

unmatched_df = df2.join(df1, on=["id"], how="leftanti") # Returns the records from df2 if the id is not present in df1

output_df = matched_df.unionByName(unmatched_df)

Please let me know if you have any other queries.
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-08-05T00:38:23.61+00:00

Pankaj Joshi: Test and let me know if the above solution worked
Vinodh247 34,661 Reputation points MVP Volunteer Moderator

2024-08-07T12:45:15.7033333+00:00

Pankaj Joshi: Please confirm if you had a chance to try any of the suggestions provided for your question.
Pankaj Joshi 391 Reputation points

2024-08-10T08:26:18.84+00:00

It work thanks

Answer 1

@Pankaj Joshi - Thanks for the question and using MS Q&A platform.

This is the pyspark code to achieve to return matching row by (ID) from df_1 and only non matching row from df_2:

df_matching = df_1.join(df_2, on='ID', how='inner')
df_nonmatching = df_2.join(df_1, on='ID', how='left_anti')

df_result = df_matching.union(df_nonmatching)

display(df_result)

Note: When you run the above code, you will see the error message:

Because your data contains for ID 100 as mike and mike2 and 104 as henry and henry2.

User's image

Reason: UNION can only be performed on inputs with the same number of columns, but the first input has 5 columns and the second input has 3 columns.

User's image

Slight modification to the code would works as expected:

Explanation:

inner join is used to find rows that have matching ID in both dataframes.
left_anti join is used to find rows in df_2 that do not have a matching ID in df_1.
union is used to combine the matching rows from df_1 and non-matching rows from df_2.

This should give you the desired output as df_final containing matching rows from df_1 based on ID and non-matching rows from df_2.

# Perform an inner join to get matching rows based on the ID
df_matching = df_1.join(df_2, on="ID", how="inner").select(df_1["*"])
# Get the non-matching rows from df_2 using a left anti join
df_non_matching = df_2.join(df_1, on="ID", how="left_anti")
# Union the matching rows from df_1 with the non-matching rows from df_2
df_final = df_matching.union(df_non_matching)
# Show the final dataframe
df_final.show()

User's image

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 2

Vinodh247 34,661 MVP Volunteer Moderator

you can use a combination of join() and except() in pyspark. Try the below code and let me know if it works

# Join df_1 and df_2 on ID column 
df_joined = df_1.join(df_2, on="ID", how="outer") 
# Get non-matching rows from df_2 
df_nonmatching = df_joined.where((df_joined.ID.isNotNull()) & (df_joined["ID_1"].isNull())) # Select desired columns 
df_final = df_nonmatching.select("ID", "Name_1", "Number_1", "Name_2", "Number_2")

Share via

azure databrick pyspark combine two dataframe

1 additional answer

Your answer