Can I use Change Data Capture between blob storages ?

kosmos 246 Reputation points
2023-06-30T10:08:01.35+00:00

I have one blob storage with many folders and files.

I want to iterate through the folders and the files to load them to Snowflake. First through a staging table ( I am flexible to not use the staging table if necessary)

Does it make sense to use the Change Data Capture tool ? Or is it made only for Azure SQL databases ?

Thank you!

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,623 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Subashri Vasudevan 11,226 Reputation points
    2023-07-02T04:12:06.9133333+00:00

    Hi @kosmos

    Yes right now only Azure SQL database and SQL managed instance is what is supported. Along with other file based target as mentioned below.

    Screenshot_2023-07-02-09-36-28-32_40deb401b9ffe8e1df2f1cc5ba480b12

    Please check this link for more details and limitations.If your staging is az. Sql database or any of the supported file type, it will be a better option to use CDC.

    Hope that helps.

    Thanks

    0 comments No comments

  2. Sedat SALMAN 14,180 Reputation points MVP
    2023-07-02T04:22:24.3+00:00

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-change-data-capture

    https://learn.microsoft.com/en-us/azure/data-factory/connector-snowflake

    Yes, you can use Change Data Capture (CDC) between blob storages.

    The Azure Data Factory (ADF) provides several methods

    Change Data Capture factory resource: This method provides an easy way to get started with CDC in ADF. You can select your sources and destinations, apply optional transformations, and then start your data capture. The top-level CDC resource can run continuously, which is ideal for your scenario where data in the blob storage may change over time.

    Native change data capture in mapping data flow: ADF mapping data flow can automatically detect and extract changed data, including inserted, updated, and deleted rows, from the source databases. This method uses native CDC technology in the databases.

    Auto incremental extraction in mapping data flow: This method can automatically detect and extract newly updated rows or updated files from the source stores. When loading new or updated files only from a storage store, ADF mapping data flow works through files' last modify time.

    Change files capture from file based storages: ADF provides a simple one-click solution to load new or updated files only from file-based storages, such as Azure Blob Storage, Azure Data Lake Storage Gen2, or Azure Data Lake Storage Gen1. This is the simplest and recommended way to achieve delta load from these file-based storages in mapping data flow.

    For your ue case, you can use these methods to extract data from your blob storage, and then load the data into Snowflake. Azure Data Factory supports Snowflake as both a source and a sink. You can use the Copy activity to copy data from the blob storage to Snowflake. When copying data to Snowflake, it takes advantage of Snowflake's COPY into [table] command to achieve the best performance.

    Also, the Snowflake connector in ADF supports several capabilities like Copy activity (source/sink), Mapping data flow (source/sink), Lookup activity, and Script activity. The connector is compatible with both Azure integration runtime and self-hosted integration runtime.

    When defining the Snowflake dataset in ADF, the type property of the dataset must be set to SnowflakeTable. If Snowflake is the sink, the name of the schema and the name of the table/view must be specified, and these names are case-sensitive.

    In conclusion, it definitely makes sense to use CDC for your use case. It's not limited to Azure SQL databases, and it can be used effectively with blob storages and Snowflake.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.