How to load data from On-prem to Snowflake using ADF in better way

jason2500 20 Reputation points
2024-01-23T19:56:28.0866667+00:00

Hi,

  My use case is as follows:   Our data source is an On-prem SQL Server, and it serves as our production database. Currently, we are building reports in Power BI and utilizing Snowflake as our data warehouse. I aim to extract 10 to 15 tables for my Power BI reporting into Snowflake, specifically wanting to construct a SCD Type 1 Pipeline, without the need to retain historical data.   To facilitate the data transfer from On-Prem to Snowflake, we are leveraging Azure Data Factory, with blob storage set up in Azure. We already have a Self-hosted runtime in place that connects to Data Factory.   Currently, I've employed the For Each loop activity to copy 15 tables from On-prem to Snowflake. The pipeline is scheduled to run daily, and each time it executes, it truncates all 15 tables before loading the data. However, this process is time-consuming due to the volume of data, especially since our On-prem SQL server is a legacy database.   My questions are as follows:  

  1. Is there a more efficient approach within Data Factory for this task? What are the best practices recommended for this type of case study?
  2. Dataflow is not functioning with the Self-hosted runtime; how can I activate Dataflow for an On-prem database?

I would greatly appreciate any advice on the correct solution for this or pointers to relevant documentation or blogs.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
0 comments No comments
{count} votes

Accepted answer
  1. AnnuKumari-MSFT 34,556 Reputation points Microsoft Employee Moderator
    2024-01-24T08:26:50.25+00:00

    Hi jason2500 ,

    Welcome to Microsoft Q&A platform and thanks for posting your query here.

    As per my understanding, you are trying to copy data from on prem sql server to snowflake Datawarehouse using ADF pipeline and you want to know the best practice to achieve this . Please let me know if that is not the ask.

    I understand that full load technique would take a lot of time since truncating and loading the entire set of records is only recommended for small set of data load. However, in your case, you may need to consider using incremental load technique.

    There are two possible ways to achieve this.

    • Create an end to end incremental load pipeline
    • Use CDC (Change data capture) approach

    However CDC also uses dataflow in the background , so for your case, you may opt for first approach.

    To create incremental load pipeline, you would require to rely on 'Benchmark' or 'Watermark' column which would store what is the lastupdated date value for that particular record. You can write the query to load only the records from the source whose benchmarkvalue(onprem sql) >lastupdateddate in the sink(snowflake) .

    To know more about the implementation, kindly checkout the below resources:

    https://learn.microsoft.com/en-us/azure/data-factory/tutorial-incremental-copy-overview

    https://www.youtube.com/watch?v=FXw1gPaa2-M&list=PLsJW07-_K61JkuvlNfCK9gTEft_N8MVsP&index=9

    https://www.youtube.com/watch?v=IPpYDQ0QE_E&list=PLsJW07-_K61JkuvlNfCK9gTEft_N8MVsP&index=11

    To implement CDC, you can check the below resources:

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

    https://www.youtube.com/watch?v=-NcyGDgiTEo

    Hope it helps. Kindly accept the answer by clicking on Accept answer button. Thankyou

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. jason2500 20 Reputation points
    2024-01-24T19:30:34.3966667+00:00

    Thanks @AnnuKumari-MSFT hanks! This is exactly what I am looking for. One more question: we created a Self-Hosted runtime on a Virtual Machine (SHIR) for our on-prem SQL server to connect in the data factory. However, the dataflow is not working for the self-hosted runtime, and it only supports Azure runtime. Upon researching, Microsoft suggests creating a managed VNet, but we are unable to do so as we don't have network engineers. Could you point out some videos for us to look into? This way, we can implement this solution.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.