how to unpivot columns is Pyspark Dataframe in mutiple columns using synapse notebook

Heta Desai 357 Reputation points
2022-10-26T21:01:46.937+00:00

Hi,

I want to unpivot columns in pyspark dataframe.I have 3 group of columns and on that basis I need to unpivot those columns and generate 6 new columns.

Here is the example:

Id Key Code Label1 Label2 Label3 Rate1 Rate2 Rate3 CancelRate1 CancelRate2 CancelRate3
1 K1 c1 1 0 3 0.00 1.00 0.00 1.00 0.00 0.00

expected output:

Id Key Code LabelName LabelId RateName Rate CanceRateName CancelRate
1 K1 c1 Label1 1 Rate1 0.00 CancelRate1 1.00
1 K1 c1 Label2 0 Rate2 1.00 CancelRate2 0.00
1 K1 c1 Label3 3 Rate3 0.00 CancelRate3 0.00

Please suggest me the solution for this. Once columns get unpivoted I need to perform aggregation and again pivot columns. As dataframe size is too large I can not use pandas library

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,486 Reputation points Microsoft Employee Moderator
    2022-10-28T01:18:31.55+00:00

    Hello @Anonymous ,
    Thanks for the question and using MS Q&A platform.
    As we understand the ask here is to Unpivot the dataframe, please do let us know if its not accurate.
    As I understand that you are unpivoting some data and then Pivot that back , I am sure you understand the data better , but I request to review the logic again as Unpivoting and Pivoting the data does not make lot of sense .
    UnPivot in SPark is not supported and you will ghave to use the stack function . In Stack function you can only pass one set of columns and so you will have to create three dataframe one for label,rate and CancelRate . And then Join the 3 dataframe and get the output which you want . I am sharing the code for implementation the stack function .

    import pyspark
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import expr

    Create spark session

    df = [(1,"K1","c1",1,0,3,0.00,1.00,0.00,1.00,0.00,0.00)]

    columns= ["Id","Key","Code","Label1","Label2","Label3","Rate1","Rate2","Rate3","CancelRate1","CancelRate2","CancelRate3"]
    df = spark.createDataFrame(data = df, schema = columns)
    df.printSchema()
    df.show(truncate=False)

    unpivotExpr1 = "stack(3, 'Label1',Label1, 'Label2',Label2, 'Label3',Label3) as (Label,Total)"
    unpivotExpr2 = "stack(3, 'Rate1',Rate1,'Rate2',Rate2,'Rate3',Rate3) as (Rate,Total)"
    unPivotDF = df.select("id","Key","Code", expr(unpivotExpr1))
    dflabel = unPivotDF.withColumnRenamed('Total','LabelId ')
    dflabel.show()

    unPivotDF = df.select("id","Key","Code", expr(unpivotExpr2))
    dfrate = unPivotDF.withColumnRenamed('Total','Rate ')
    dfrate.show()

    254931-image.png

    Please do let me if you have any queries.
    Thanks
    Himanshu


    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
      • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.