comparision of 2 parquet files and store difference in data in new data frame

Question

comparision of 2 parquet files and store difference in data in new data frame

Raj0125 511

Hi,

We have parquet files which are coming daily basis which does not had date columns.I need to comapare the data in todays file and yesterdays file and take the difference of reocrds into new dataframe handle in data bricks.

Please suggest.

Vaibhav Chaudhari 38,921 Reputation points Volunteer Moderator

2022-01-25T11:02:21.72+00:00

Can you provide some sample / test data for previous and current day file along with result expected in new dataframe?
Raj0125 511 Reputation points

2022-01-25T11:43:40.137+00:00

Hi,

Data as below in parquet format.

![File1 :

EMPNO EMPNAME SAL]1
HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2022-01-27T20:25:26.377+00:00

Hello @Raj0125 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu
HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2022-02-01T05:01:59.917+00:00

Hello @Raj0125 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu

Accepted answer

0 additional answers

Your answer

Vaibhav Chaudhari 38,921 Reputation points Volunteer Moderator

2022-01-25T11:02:21.72+00:00

Can you provide some sample / test data for previous and current day file along with result expected in new dataframe?
Raj0125 511 Reputation points

2022-01-25T11:43:40.137+00:00

Hi,

Data as below in parquet format.

![File1 :

EMPNO EMPNAME SAL]1
HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2022-01-27T20:25:26.377+00:00

Hello @Raj0125 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu
HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator

2022-02-01T05:01:59.917+00:00

Hello @Raj0125 ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .In case if you have any resolution please do share that same with the community as it can be helpful to others . Otherwise, will respond back with the more details and we will try to help .
Thanks
Himanshu

Answer 1

Hello @Raj0125 ,
Thanks for the ask and using Microsoft Q&A platform .
As I understand the ask here is to find the records which are not there in the latest file . Please do let me know if thats not correct .

We can use the subtract API to achieve this . The idea is to load the records from two different days in two different dataframes and then compare them .

In my example I have taken some dummy data ( in your case you will have to load the data from the paraquet file into the dataframe df1 and df2 )

import pandas as pd
import numpy as np
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data1 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]

data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1),
("Himanshu","XXXX","YYYYY","","M",-1) ]

schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])

df2 = spark.createDataFrame(data=data2,schema=schema)
df1 = spark.createDataFrame(data=data1,schema=schema)
finaldf = df2.subtract(df1)
display(finaldf)
Output

Please do let me know how it goes .
Thanks
Himanshu

-------------------------------------------------------------------------------------------------------------------------

Please don't forget to click on or upvote button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
Want a reminder to come back and check responses? Here is how to subscribe to a notification
If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators

Share via

comparision of 2 parquet files and store difference in data in new data frame

0 additional answers

Your answer