How to write pyspark dataframe into Synapse Table using column name mapping

Nico Wijaya 45 Reputation points
2023-08-30T09:31:40.7933333+00:00

Hi experts,

I'm trying to do ETL from a source parquet file with the following column names and order:

  1. [AT_Number]
  2. [AT_Indicators]
  3. [AT_Date]
  4. [AT_JobId]
  5. etc.

User's image

The destination table is on Synapse Analytics database and have the following column names and order:

  1. [AT_Indicators]
  2. [AT_Number]
  3. [AT_Date]
  4. [AT_JobId]
  5. etc.

User's image

Please note that the order of column 1 & 2 in the source is flipped compared to the destination. The pyspark dataframe I am using during transformation has the same column ordering as the source. But when I write (with mode = append) the transformed data into the destination table, the column values are flipped as if the writing process is NOT done using column name mapping, but instead, column order.

User's image

User's image

Can anyone help shedding a light into how we can write using column name mapping please?

Thank you in advance for your help.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,848 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,154 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 88,561 Reputation points Microsoft Employee
    2023-09-07T03:53:11.7066667+00:00

    @Nico Wijaya - Dataframes are immutable,  so you will need to create a new object to apply/get the transformations. 

    Also ensure you are projecting the columns in the correct order to match your table schema. 

    You can do something like the following to resolve the issue: 

    Modify line 66 df_create_new = df_create.select(['AT_Indicators' ......

    ** Notice how I put the data frame into a new object and in the select statement put "At_Indicators" first to match your table. 

    Modify line 71 to use  the new data frame df_create_new.

     Note ** you can do df_create = df_create.select  ... though sometimes this may not be clear to the person reading your code. 

    Hope this helps. Do let us know if you any further queries.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.