how to improve the write time to a synapse table with azure databricks

Luis Cespedes 41 Reputation points
2022-02-11T19:35:37.63+00:00

I am trying to save a pyspark dataframe with millions of rows and 110 columns to a synapse table and it takes a long time, I would like to know some ways in which I can shorten the time, searching a bit I found that enable this configuration
spark.conf.set("spark.databricks.sqldw.writeSemantics", "copy")

But I would like to know if there are other options to improve the writing time

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,467 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,968 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,381 Reputation points Microsoft Employee
    2022-02-14T22:27:40.133+00:00

    Hello @Luis Cespedes ,
    Thanks for the ask and using Microsoft Q&A platform .
    As we understand the ask here is how ti improve the writing to Synapse . Please do let me know if that not accurate
    When you say Synapse I am assming that you are refering to dedicated SQL .I am nots uere what is distibution whcih is set on the sink table . It may ve worth exploring . You can read more on the same here .
    Adding some text from the above mentione link for reference .

    What is a distributed table?

    A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm.

    Hash-distribution improves query performance on large fact tables, and is the focus of this article. Round-robin distribution is useful for improving loading speed. These design choices have a significant impact on improving query and loading performance.

    Another table storage option is to replicate a small table across all the Compute nodes. For more information, see Design guidance for replicated tables. To quickly choose among the three options, see Distributed tables in the tables overview.

    As part of table design, understand as much as possible about your data and how the data is queried. For example, consider these questions:

    How large is the table?
    How often is the table refreshed?
    Do I have fact and dimension tables in a dedicated SQL pool?

    Please do let me if you have any queries .
    Thanks
    Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators