Load data from Delta Lake to Synapse

Question

Load data from Delta Lake to Synapse

Muruga MuthuKrishnan 26

Hi Team,

We are using Synapse Analytics, in ADLS we are storing the files in Delta format. Now we need to load the data which is in Delta format from ADLS to Synapse. We need to know which is the optimal, performant & cost effective approach to load the data in Delta format from ADLS to Synapse. Please share associated artifacts or tutorials for the same.

If we have multiple approaches then we need to provide Pros & Cons of each approach.

Muruga MuthuKrishnan 26 Reputation points

2023-02-27T07:22:23.9233333+00:00
Hi Bhargav,

Thanks for your input. We need to use the approach which can support Incremental load from ADLS to Synapse, as per our understanding Polybase & Spark based approach will support however ADF-Copy & synapse approach is not capable to support the incremental load. Please clarify.

For Spark based approach do we need to enable the spark-cluster as an dedicated cluster to perform the load from ADLS to Synapse.

External table approach is not supported for Delta Format, Delta format is supported only for Serverless SQL pool.

For your other proposals can you suggest some documentation for the same.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-28T20:11:32.0133333+00:00

Hello Muruga MuthuKrishnan.

Yes, you are correct. The PolyBase and Spark-based approaches can support the incremental load from ADLS to Synapse.

For the Spark-based approach, you can use the Azure Synapse Dedicated SQL Pool Connector for Apache Spark to move data between the Synapse Serverless Spark Pool and the Synapse Dedicated SQL Pool.

You can find more information on this approach at the following link:

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/spark/synapse-spark-sql-pool-import-export.md

Regarding the Spark cluster, you can use either a dedicated Spark cluster or a serverless Spark pool.

The choice between the two depends on your specific requirements and workloads. A dedicated Spark cluster provides more control over the resources and performance, while a serverless Spark pool provides more flexibility and cost-effectiveness.

Regarding the external table approach, you are correct that the Delta format is supported only for the serverless SQL pool. The external table approach is not supported for the Delta format.

Other reference docs:
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/sql-data-warehouse/design-elt-data-loading.md
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/guidance/implementation-success-evaluate-serverless-sql-pool-design.md

I hope this helps. Please let me know if you have any other questions.

If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions

1 answer

Your answer

Muruga MuthuKrishnan 26 Reputation points

2023-02-27T07:22:23.9233333+00:00

Hi Bhargav,

Thanks for your input. We need to use the approach which can support Incremental load from ADLS to Synapse, as per our understanding Polybase & Spark based approach will support however ADF-Copy & synapse approach is not capable to support the incremental load. Please clarify.

For Spark based approach do we need to enable the spark-cluster as an dedicated cluster to perform the load from ADLS to Synapse.

External table approach is not supported for Delta Format, Delta format is supported only for Serverless SQL pool.

For your other proposals can you suggest some documentation for the same.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-28T20:11:32.0133333+00:00

Hello Muruga MuthuKrishnan.

Yes, you are correct. The PolyBase and Spark-based approaches can support the incremental load from ADLS to Synapse.

For the Spark-based approach, you can use the Azure Synapse Dedicated SQL Pool Connector for Apache Spark to move data between the Synapse Serverless Spark Pool and the Synapse Dedicated SQL Pool.

You can find more information on this approach at the following link:

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/spark/synapse-spark-sql-pool-import-export.md

Regarding the Spark cluster, you can use either a dedicated Spark cluster or a serverless Spark pool.

The choice between the two depends on your specific requirements and workloads. A dedicated Spark cluster provides more control over the resources and performance, while a serverless Spark pool provides more flexibility and cost-effectiveness.

Regarding the external table approach, you are correct that the Delta format is supported only for the serverless SQL pool. The external table approach is not supported for the Delta format.

Other reference docs:
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/sql-data-warehouse/design-elt-data-loading.md
https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/guidance/implementation-success-evaluate-serverless-sql-pool-design.md

I hope this helps. Please let me know if you have any other questions.

If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions

Answer 1

Hello @Anonymous ,

There are multiple methods to load Delta format data from ADLS to Synapse. The optimal, performant, and cost-effective approach to load Delta format data from ADLS to Synapse depends on your specific requirements and use case.

Here are some of them.

PolyBase External Tables: PolyBase External Tables allow you to create an external table in Synapse using PolyBase to connect to the Delta files in ADLS. You can then query the external table in Synapse using T-SQL statements.

This approach is cost-effective since it doesn't require any data movement. However, performance may be slower compared to other options since the data is read directly from the ADLS files.

Pros:

No data movement, which can reduce costs
Simple to set up and use
Suitable for simple use cases where performance is not critical

Cons:

It may not be as performant as other options since the data is read directly from ADLS files
Limited functionality compared to other options

Reference document: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/load-data-overview

Azure Data Factory: Azure Data Factory can be used to copy data from ADLS to Synapse. ADF provides a Delta format connector that you can use to copy data in Delta format from ADLS to Synapse.

This approach allows for more flexibility since you can transform and clean the data during the copy process. However, it involves data movement, which can increase costs.

Pros:

Flexible and customizable
Suitable for complex use cases with large amounts of data
Can transform and clean data during the copy process

Cons:

Involves data movement, which can increase costs
May have higher latency compared to other options

Reference document: https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow-delta-lake

Azure Databricks: Azure Databricks can be used to load data from Delta files in ADLS to Synapse. This involves using the Azure Synapse Spark connector to write data to Synapse.

This highly performative approach allows for transformations and data processing using Spark. However, it may be more expensive compared to other options since it involves running a Databricks cluster.

Pros:

Highly performant
Allows for complex transformations and processing of data using Spark
Suitable for large-scale data processing

Cons:

More expensive compared to other options since it involves running a Databricks cluster
It may have a steeper learning curve compared to other options

Reference document: https://learn.microsoft.com/en-us/azure/databricks/delta/

Azure Synapse Studio: Azure Synapse Studio is an integrated workspace that provides a unified experience for developing and managing big data and data warehousing solutions. Azure Synapse Studio can load data in Delta format from ADLS to Synapse.

You can either use the built-in data ingestion tools or write custom code to load the data. This approach provides a user-friendly interface and allows for customization, but it may not be as performant as using Azure Databricks.

Each method has its own pros and cons, and the optimal method depends on your specific use case and requirements.

Pros:

User-friendly interface
Allows for customization
Suitable for simple use cases

Cons:

It may not be as performant as other options
Limited functionality compared to other options

Reference document: https://learn.microsoft.com/en-us/azure/synapse-analytics/quickstart-load-studio-sql-pool

I hope this helps. Please let me know if you have any further questions,

Share via

Load data from Delta Lake to Synapse

1 answer

Your answer