comparision of 2 parquet files and store difference in data in new data frame

Raj0125 511 Reputation points
2022-01-25T10:59:22.967+00:00

Hi,

We have parquet files which are coming daily basis which does not had date columns.I need to comapare the data in todays file and yesterdays file and take the difference of reocrds into new dataframe handle in data bricks.

Please suggest.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,539 questions
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,491 Reputation points Microsoft Employee Moderator
    2022-01-25T23:49:59.433+00:00

    Hello @Raj0125 ,
    Thanks for the ask and using Microsoft Q&A platform .
    As I understand the ask here is to find the records which are not there in the latest file . Please do let me know if thats not correct .

    We can use the subtract API to achieve this . The idea is to load the records from two different days in two different dataframes and then compare them .

    In my example I have taken some dummy data ( in your case you will have to load the data from the paraquet file into the dataframe df1 and df2 )

    import pandas as pd
    import numpy as np
    from pyspark.sql.types import StructType,StructField, StringType, IntegerType
    data1 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
    ]

    data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1),
    ("Himanshu","XXXX","YYYYY","","M",-1) ]

    schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
    ])

    df2 = spark.createDataFrame(data=data2,schema=schema)
    df1 = spark.createDataFrame(data=data1,schema=schema)
    finaldf = df2.subtract(df1)
    display(finaldf)
    Output

    h

    Please do let me know how it goes .
    Thanks
    Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.