@Pankaj Joshi - Thanks for the question and using MS Q&A platform.
This is the pyspark code to achieve to return matching row by (ID) from df_1 and only non matching row from df_2:
df_matching = df_1.join(df_2, on='ID', how='inner')
df_nonmatching = df_2.join(df_1, on='ID', how='left_anti')
df_result = df_matching.union(df_nonmatching)
display(df_result)
Note: When you run the above code, you will see the error message:
Because your data contains for ID 100
as mike
and mike2
and 104
as henry
and henry2
.
Reason: UNION can only be performed on inputs with the same number of columns, but the first input has 5 columns and the second input has 3 columns.
Slight modification to the code would works as expected:
Explanation:
-
inner
join is used to find rows that have matchingID
in both dataframes. -
left_anti
join is used to find rows indf_2
that do not have a matchingID
indf_1
. -
union
is used to combine the matching rows fromdf_1
and non-matching rows fromdf_2
.
This should give you the desired output as df_final
containing matching rows from df_1
based on ID
and non-matching rows from df_2
.
# Perform an inner join to get matching rows based on the ID
df_matching = df_1.join(df_2, on="ID", how="inner").select(df_1["*"])
# Get the non-matching rows from df_2 using a left anti join
df_non_matching = df_2.join(df_1, on="ID", how="left_anti")
# Union the matching rows from df_1 with the non-matching rows from df_2
df_final = df_matching.union(df_non_matching)
# Show the final dataframe
df_final.show()
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.