Loading data into a dedicated SQL pool in Azure Synapse Analytics

Question

Hi Mr/Mrs,

I read the document in this link https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/data-loading-best-practices

When I read the content Load to a staging table: User's image

I have 3 questions:

I understand that the staging table is the heap and round-robin table in Dedicated-SQL pool, is it correct?
Should I create a staging table in datalake then use copy activity in ADF to move data from datalake staging table to dedicated sql pool staging table?
Which is betterbetween PolyBase and Bulk Insert? What is the reason?

Many thanks!

Accepted Answer

Hi @Viet Tran

Thank you for reaching out with your questions about loading data into a dedicated SQL pool in Azure Synapse Analytics.

To answer your first question

Staging Table as Heap and Round-Robin in Dedicated SQL Pool:

The staging table is typically a heap and uses round-robin distribution to evenly distribute data across all distributions in Dedicated SQL Pool.
This design facilitates fast data loading, but it may result in slower query performance due to the absence of indexes.
This is usually acceptable as the staging table is a temporary holding area.

Example: In Azure Synapse Analytics, you might create a staging table named "staging_table" without any indexes (heap) and specify the round-robin distribution method to ensure even data distribution across all distributions.

Regarding your second question

Creating Staging Table in Data Lake and Using ADF Copy Activity:

It is recommended to prepare data in Azure Storage and use Azure Data Factory's Copy Activity to move data from the Data Lake staging table to the Dedicated SQL pool staging table.
This approach offers efficient handling of large volumes of data and allows for preprocessing capabilities in the Data Lake before loading.
It adds an additional step and complexity to the data loading process.
There be potential additional storage costs for the Data Lake.

Example: You can create a staging table in Azure Data Lake Storage, say "datalake_staging_table", and then use Azure Data Factory's Copy Activity to move data to the "staging_table" in the Dedicated SQL pool.

Finally, regarding your third question

Poly Base vs Bulk Insert:

PolyBase is preferred for loading large volumes of data because it supports high-throughput parallel loads and can directly handle external sources like Azure Blob Storage or Azure Data Lake Store.
On the other hand, Bulk Insert is a slower load method that loads data through the control node and is not as highly parallel or performant as PolyBase.
PolyBase has high performance for large files, scalability, and direct support for external data sources, but it has restrictions on row size (cannot load rows with more than 1,000,000 bytes of data).
Bulk Insert is slower and less efficient than PolyBase, especially for large files, but it's more flexible in terms of data types and file formatting, making it suitable for smaller data loads.

Example: For loading a large dataset into Azure Synapse Analytics, you would use PolyBase to import data from Azure Blob Storage, while Bulk Insert might be used for smaller datasets directly from

Reference:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-dedicated-sql-pool
https://stackoverflow.com/questions/67744007/what-is-the-difference-between-polybase-and-bulk-insert-copy-methods-in-azure-d

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Loading data into a dedicated SQL pool in Azure Synapse Analytics

0 additional answers