Synapse Link vs Data Factory vs Polybase

kosmos 246 Reputation points
2022-09-19T13:45:14.56+00:00

My objective is to build a good automated ETL process to load data into Synapse DWH - Data Vault model.

My data is stored in Cosmos DB and Azure SQL DB

There are many options to load data into Synapse DWH:

  • Polybase: used for data virtualization. Not for job scheduling, orchestration, logging, multiple loads... Therefore not interesting for my use case
  • Data Factory: pipeline orchestration
  • Synapse Link: direct connection with Synapse DWH

At a first glance, I would say Synapse Link (Cosmos & Azure SQL DB) are the way to go. But I wonder in which case I would then use Data Factory.

Thanks in advance

Azure SQL Database
Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,921 questions
Azure Cosmos DB
Azure Cosmos DB
An Azure NoSQL database service for app development.
1,638 questions
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 89,571 Reputation points Microsoft Employee
    2022-09-20T06:33:51.08+00:00

    Hello @kosmos ,

    Thanks for the question and using MS Q&A platform.

    Since you are data is stored in Cosmos DB and Azure SQL DB.

    Azure Data Factory and Synapse pipelines support three ways to load data into Azure Synapse Analytics.

    • Use COPY statement
    • Use PolyBase
    • Use bulk insert

    Note: The fastest and most scalable way to load data is through the COPY statement or the PolyBase.

    The two options labeled “Polybase” and the “COPY command” are only applicable to Azure Synapse Analytics. They are both fast methods of loading which involve staging data in Azure storage (if it’s not already in Azure Storage) and using a fast, highly parallel method of loading to each compute node from storage. Especially on large tables these options are preferred due to their scalability but they do come with some restrictions documented at the link above.

    In contrast, on Azure Synapse Analytics a bulk insert is a slower load method which loads data through the control node and is not as highly parallel or performant. It is an order of magnitude slower on large files. But it can be more forgiving in terms of data types and file formatting.

    On other Azure SQL databases, bulk insert is the preferred and fast method.

    For more details, refer to Azure Synapse Analytics as sink.

    Synapse Link brings together Azure Cosmos DB/SQL analytical store with Azure Synapse Analytics runtime support. This integration enables you to build cloud native HTAP (Hybrid transactional/analytical processing) solutions that generate insights based on real-time updates to your operational data over large datasets. It unlocks new business scenarios to raise alerts based on live trends, build near real-time dashboards, and business experiences based on user behavior.

    Hope this will help. Please let us know if any further queries.

    ------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is jhow you can be part of Q&A Volunteer Moderators
    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.