How to write pyspark dataframe into Synapse Table using column name mapping

Question

How to write pyspark dataframe into Synapse Table using column name mapping

Nico Wijaya 45

Hi experts,

I'm trying to do ETL from a source parquet file with the following column names and order:

[AT_Number]
[AT_Indicators]
[AT_Date]
[AT_JobId]
etc.

User's image

The destination table is on Synapse Analytics database and have the following column names and order:

[AT_Indicators]
[AT_Number]
[AT_Date]
[AT_JobId]
etc.

User's image

Please note that the order of column 1 & 2 in the source is flipped compared to the destination. The pyspark dataframe I am using during transformation has the same column ordering as the source. But when I write (with mode = append) the transformed data into the destination table, the column values are flipped as if the writing process is NOT done using column name mapping, but instead, column order.

User's image

Can anyone help shedding a light into how we can write using column name mapping please?

Thank you in advance for your help.

PRADEEPCHEEKATLA 90,641 Moderator

@Nico Wijaya - Thanks for the question and using MS Q&A platform.

To write a PySpark DataFrame into a Synapse table using column name mapping, you can use the write method of the DataFrameWriter class and specify the column mapping using the option method. Here's an example code snippet:

from pyspark.sql import SparkSession

# create a SparkSession
spark = SparkSession.builder.appName("Write to Synapse").getOrCreate()

# read the source parquet file into a DataFrame
df = spark.read.parquet("path/to/source/parquet")

# transform the DataFrame as needed

# write the transformed DataFrame to the Synapse table using column name mapping
df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<your_server>.database.windows.net:1433;database=<your_database>;user=<your_username>;password=<your_password>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \
  .option("dbtable", "<your_schema>.<your_table>") \
  .option("forward_spark_azure_storage_credentials", "true") \
  .option("tempdir", "wasbs://<your_container>@<your_account>.blob.core.windows.net/tempdir") \
  .option("tableOptions", "DISTRIBUTION = ROUND_ROBIN") \
  .option("columnMapping", "AT_Number=AT_Number,AT_Indicators=AT_Indicators,AT_Date=AT_Date,AT_JobId=AT_JobId") \
  .mode("append") \
  .save()

In the option method, you can specify the column mapping using the columnMapping parameter. The parameter takes a comma-separated list of column mappings in the format of source_column_name=destination_column_name. In this example, the column mapping is specified as AT_Number=AT_Number,AT_Indicators=AT_Indicators,AT_Date=AT_Date,AT_JobId=AT_JobId.

By specifying the column mapping, the write operation will use the destination column names instead of the order of the columns in the DataFrame.

Hope this helps. Do let us know if you any further queries.

Nico Wijaya 45 Reputation points

2023-08-31T09:03:20.6333333+00:00

Hi @PRADEEPCHEEKATLA ,

Thank you for answering my question. What you're saying makes perfect sense, I think that's what I've been looking for. I just couldn't find the full range of "option" method's parameters which are available. I've been referring to this documentation below and thought they're the only available parameters.
https://learn.microsoft.com/en-us/azure/databricks/external-data/synapse-analytics

Would you be able to suggest another document containing the full range of parameters available please?
Thank you,
Nico
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-09-04T07:39:11.5466667+00:00

@Nico Wijaya - DId you try the above method which I shared above and are you experiencing any error message?
Nico Wijaya 45 Reputation points

2023-09-04T07:40:05.0533333+00:00

Hi @PRADEEPCHEEKATLA ,

I tried the columnMapping option, but it's somehow still writing based on the column order of the dataframe.

Any idea how I can check deeper into it? Thank you.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-09-11T08:27:42.98+00:00

@Nico Wijaya - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Accepted answer

0 additional answers

Your answer

PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-08-31T07:42:49.36+00:00

@Nico Wijaya - Thanks for the question and using MS Q&A platform.

To write a PySpark DataFrame into a Synapse table using column name mapping, you can use the write method of the DataFrameWriter class and specify the column mapping using the option method. Here's an example code snippet:

from pyspark.sql import SparkSession # create a SparkSession spark = SparkSession.builder.appName("Write to Synapse").getOrCreate() # read the source parquet file into a DataFrame df = spark.read.parquet("path/to/source/parquet") # transform the DataFrame as needed # write the transformed DataFrame to the Synapse table using column name mapping df.write \ .format("com.databricks.spark.sqldw") \ .option("url", "jdbc:sqlserver://<your_server>.database.windows.net:1433;database=<your_database>;user=<your_username>;password=<your_password>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;") \ .option("dbtable", "<your_schema>.<your_table>") \ .option("forward_spark_azure_storage_credentials", "true") \ .option("tempdir", "wasbs://<your_container>@<your_account>.blob.core.windows.net/tempdir") \ .option("tableOptions", "DISTRIBUTION = ROUND_ROBIN") \ .option("columnMapping", "AT_Number=AT_Number,AT_Indicators=AT_Indicators,AT_Date=AT_Date,AT_JobId=AT_JobId") \ .mode("append") \ .save()

In the option method, you can specify the column mapping using the columnMapping parameter. The parameter takes a comma-separated list of column mappings in the format of source_column_name=destination_column_name. In this example, the column mapping is specified as AT_Number=AT_Number,AT_Indicators=AT_Indicators,AT_Date=AT_Date,AT_JobId=AT_JobId.

By specifying the column mapping, the write operation will use the destination column names instead of the order of the columns in the DataFrame.

Hope this helps. Do let us know if you any further queries.
Nico Wijaya 45 Reputation points

2023-08-31T09:03:20.6333333+00:00

Hi @PRADEEPCHEEKATLA ,

Thank you for answering my question. What you're saying makes perfect sense, I think that's what I've been looking for. I just couldn't find the full range of "option" method's parameters which are available. I've been referring to this documentation below and thought they're the only available parameters.
https://learn.microsoft.com/en-us/azure/databricks/external-data/synapse-analytics

Would you be able to suggest another document containing the full range of parameters available please?
Thank you,
Nico
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-09-04T07:39:11.5466667+00:00

@Nico Wijaya - DId you try the above method which I shared above and are you experiencing any error message?
Nico Wijaya 45 Reputation points

2023-09-04T07:40:05.0533333+00:00

Hi @PRADEEPCHEEKATLA ,

I tried the columnMapping option, but it's somehow still writing based on the column order of the dataframe.

Any idea how I can check deeper into it? Thank you.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-09-11T08:27:42.98+00:00

@Nico Wijaya - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

PRADEEPCHEEKATLA 90,641 Moderator

@Nico Wijaya - Dataframes are immutable, so you will need to create a new object to apply/get the transformations.

Also ensure you are projecting the columns in the correct order to match your table schema.

You can do something like the following to resolve the issue:

Modify line 66 df_create_new = df_create.select(['AT_Indicators' ......

** Notice how I put the data frame into a new object and in the select statement put "At_Indicators" first to match your table.

Modify line 71 to use the new data frame df_create_new.

Note ** you can do df_create = df_create.select ... though sometimes this may not be clear to the person reading your code.

Hope this helps. Do let us know if you any further queries.

Nico Wijaya 45 Reputation points

2023-09-13T10:38:40.4+00:00

Hi @PRADEEPCHEEKATLA ,

Apologies for the late response and thank you for your guidance. Yes, what you proposed has worked. I made the list parameterised so I can flexibly use the same notebook for other objects with similar nature.

Thank you again for your help, very much appreciated.
PRADEEPCHEEKATLA 90,641 Reputation points Moderator

2023-09-18T08:28:14.39+00:00

@Nico Wijaya - Glad to know it resolved. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

How to write pyspark dataframe into Synapse Table using column name mapping

0 additional answers

Your answer