Редагувати

Поділитися через


Microsoft Fabric decision guide: copy activity, dataflow, or Spark

Use this reference guide and the example scenarios to help you in deciding whether you need a copy activity, a dataflow, or Spark for your Microsoft Fabric workloads.

Copy activity, dataflow, and Spark properties

Pipeline copy activity Dataflow Gen 2 Spark
Use case Data lake and data warehouse migration,
data ingestion,
lightweight transformation
Data ingestion,
data transformation,
data wrangling,
data profiling
Data ingestion,
data transformation,
data processing,
data profiling
Primary developer persona Data engineer,
data integrator
Data engineer,
data integrator,
business analyst
Data engineer,
data scientist,
data developer
Primary developer skill set ETL,
SQL,
JSON
ETL,
M,
SQL
Spark (Scala, Python, Spark SQL, R)
Code written No code,
low code
No code,
low code
Code
Data volume Low to high Low to high Low to high
Development interface Wizard,
canvas
Power query Notebook,
Spark job definition
Sources 30+ connectors 150+ connectors Hundreds of Spark libraries
Destinations 18+ connectors Lakehouse,
Azure SQL database,
Azure Data explorer,
Azure Synapse analytics
Hundreds of Spark libraries
Transformation complexity Low:
lightweight - type conversion, column mapping, merge/split files, flatten hierarchy
Low to high:
300+ transformation functions
Low to high:
support for native Spark and open-source libraries

Review the following three scenarios for help with choosing how to work with your data in Fabric.

Scenario1

Leo, a data engineer, needs to ingest a large volume of data from external systems, both on-premises and cloud. These external systems include databases, file systems, and APIs. Leo doesn’t want to write and maintain code for each connector or data movement operation. He wants to follow the medallion layers best practices, with bronze, silver, and gold. Leo doesn't have any experience with Spark, so he prefers the drag and drop UI as much as possible, with minimal coding. And he also wants to process the data on a schedule.

The first step is to get the raw data into the bronze layer lakehouse from Azure data resources and various third party sources (like Snowflake Web, REST, AWS S3, GCS, etc.). He wants a consolidated lakehouse, so that all the data from various LOB, on-premises, and cloud sources reside in a single place. Leo reviews the options and selects pipeline copy activity as the appropriate choice for his raw binary copy. This pattern applies to both historical and incremental data refresh. With copy activity, Leo can load Gold data to a data warehouse with no code if the need arises and pipelines provide high scale data ingestion that can move petabyte-scale data. Copy activity is the best low-code and no-code choice to move petabytes of data to lakehouses and warehouses from varieties of sources, either ad-hoc or via a schedule.

Scenario2

Mary is a data engineer with a deep knowledge of the multiple LOB analytic reporting requirements. An upstream team has successfully implemented a solution to migrate multiple LOB's historical and incremental data into a common lakehouse. Mary has been tasked with cleaning the data, applying business logics, and loading it into multiple destinations (such as Azure SQL DB, ADX, and a lakehouse) in preparation for their respective reporting teams.

Mary is an experienced Power Query user, and the data volume is in the low to medium range to achieve desired performance. Dataflows provide no-code or low-code interfaces for ingesting data from hundreds of data sources. With dataflows, you can transform data using 300+ data transformation options, and write the results into multiple destinations with an easy to use, highly visual user interface. Mary reviews the options and decides that it makes sense to use Dataflow Gen 2 as her preferred transformation option.

Scenario3

Adam is a data engineer working for a large retail company that uses a lakehouse to store and analyze its customer data. As part of his job, Adam is responsible for building and maintaining the data pipelines that extract, transform, and load data into the lakehouse. One of the company's business requirements is to perform customer review analytics to gain insights into their customers' experiences and improve their services.

Adam decides the best option is to use Spark to build the extract and transformation logic. Spark provides a distributed computing platform that can process large amounts of data in parallel. He writes a Spark application using Python or Scala, which reads structured, semi-structured, and unstructured data from OneLake for customer reviews and feedback. The application cleanses, transforms, and writes data to Delta tables in the lakehouse. The data is then ready to be used for downstream analytics.