How to load data from On-prem to Snowflake using ADF in better way

Question

How to load data from On-prem to Snowflake using ADF in better way

jason2500 20

Hi,

My use case is as follows: Our data source is an On-prem SQL Server, and it serves as our production database. Currently, we are building reports in Power BI and utilizing Snowflake as our data warehouse. I aim to extract 10 to 15 tables for my Power BI reporting into Snowflake, specifically wanting to construct a SCD Type 1 Pipeline, without the need to retain historical data. To facilitate the data transfer from On-Prem to Snowflake, we are leveraging Azure Data Factory, with blob storage set up in Azure. We already have a Self-hosted runtime in place that connects to Data Factory. Currently, I've employed the For Each loop activity to copy 15 tables from On-prem to Snowflake. The pipeline is scheduled to run daily, and each time it executes, it truncates all 15 tables before loading the data. However, this process is time-consuming due to the volume of data, especially since our On-prem SQL server is a legacy database. My questions are as follows:

Is there a more efficient approach within Data Factory for this task? What are the best practices recommended for this type of case study?
Dataflow is not functioning with the Self-hosted runtime; how can I activate Dataflow for an On-prem database?

I would greatly appreciate any advice on the correct solution for this or pointers to relevant documentation or blogs.

Accepted answer

1 additional answer

Your answer

Answer 1

Hi jason2500 ,

Welcome to Microsoft Q&A platform and thanks for posting your query here.

As per my understanding, you are trying to copy data from on prem sql server to snowflake Datawarehouse using ADF pipeline and you want to know the best practice to achieve this . Please let me know if that is not the ask.

I understand that full load technique would take a lot of time since truncating and loading the entire set of records is only recommended for small set of data load. However, in your case, you may need to consider using incremental load technique.

There are two possible ways to achieve this.

Create an end to end incremental load pipeline
Use CDC (Change data capture) approach

However CDC also uses dataflow in the background , so for your case, you may opt for first approach.

To create incremental load pipeline, you would require to rely on 'Benchmark' or 'Watermark' column which would store what is the lastupdated date value for that particular record. You can write the query to load only the records from the source whose benchmarkvalue(onprem sql) >lastupdateddate in the sink(snowflake) .

To know more about the implementation, kindly checkout the below resources:

https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview

https://www.youtube.com/watch?v=FXw1gPaa2-M&list=PLsJW07-_K61JkuvlNfCK9gTEft_N8MVsP&index=9

https://www.youtube.com/watch?v=IPpYDQ0QE_E&list=PLsJW07-_K61JkuvlNfCK9gTEft_N8MVsP&index=11

To implement CDC, you can check the below resources:

https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

https://www.youtube.com/watch?v=-NcyGDgiTEo

Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou

Answer 2

jason2500 20

Thanks @AnnuKumari-MSFT hanks! This is exactly what I am looking for. One more question: we created a Self-Hosted runtime on a Virtual Machine (SHIR) for our on-prem SQL server to connect in the data factory. However, the dataflow is not working for the self-hosted runtime, and it only supports Azure runtime. Upon researching, Microsoft suggests creating a managed VNet, but we are unable to do so as we don't have network engineers. Could you point out some videos for us to look into? This way, we can implement this solution.

AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator

2024-01-25T06:20:39.21+00:00

Hi jason2500 , Glad to know it helped. Kindly checkout the steps provided in below documentation which depicts how to connect on prem SQL with dataflow using managed vnet : https://learn.microsoft.com/en-us/azure/data-factory/tutorial-managed-virtual-network-on-premise-sql-server https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow-private#create-an-azure-ir-in-data-factory-managed-virtual-network I will check if any videos are relevant will share them as well. Hope it helps. Kindly accept the answer by clicking on Accept answer button.

Share via

How to load data from On-prem to Snowflake using ADF in better way

1 additional answer

Your answer