Fast copy in Dataflows Gen2

This article describes the fast copy feature in Dataflows Gen2 for Data Factory in Microsoft Fabric. Dataflows help with ingesting and transforming data. With the introduction of dataflow scale out with SQL DW compute, you can transform your data at scale. However, your data needs to be ingested first. With the introduction of fast copy, you can ingest terabytes of data with the easy experience of dataflows, but with the scalable back-end of the pipeline Copy Activity.

After enabling this capability, Dataflows automatically switch the back-end when data size exceeds a particular threshold, without needing to change anything during authoring of the dataflows. After the refresh of a dataflow, you can check in the refresh history to see if fast copy was used during the run by looking at the Engine type that appears there.

With the Require fast copy option enabled, the dataflow refresh is cancelled if fast copy isn't used. This helps you avoid waiting for a refresh timeout to continue. This behavior can also be helpful in a debugging session to test the dataflow behavior with your data while reducing wait time. Using the fast copy indicators in the query steps pane, you can easily check if your query can run with fast copy.

Screenshot showing where the fast copy indicator appears in the query steps pane.

Prerequisites

  • You must have a Fabric capacity.
  • For file data, files are in .csv or parquet format of at least 100 MB, and stored in an Azure Data Lake Storage (ADLS) Gen2 or a Blob storage account.
  • For database including Azure SQL DB and PostgreSQL, 5 million rows or more of data in the data source.

Note

You can bypass the threshold to force Fast Copy by selecting "Require fast copy" setting.

Connector support

Fast copy is currently supported for the following Dataflow Gen2 connectors:

  • ADLS Gen2
  • Blob storage
  • Azure SQL DB
  • Lakehouse
  • PostgreSQL
  • On premise SQL Server

The copy activity only supports a few transformations when connecting to a file source:

  • Combine files
  • Select columns
  • Change data types
  • Rename a column
  • Remove a column

You can still apply other transformations by splitting the ingestion and transformation steps into separate queries. The first query actually retrieves the data and the second query references its results so that DW compute can be used. For SQL sources, any transformation that's part of the native query is supported.

When you directly load the query to an output destination, only Lakehouse destinations are supported currently. If you want to use another output destination, you can stage the query first and reference it later.

How to use fast copy

  1. Navigate to the appropriate Fabric endpoint.

  2. Navigate to a premium workspace and create a dataflow Gen2.

  3. On the Home tab of the new dataflow, select Options:

    Screenshot showing where to select the Options for Dataflows Gen2 on the Home tab.

  4. Then choose the Scale tab on the Options dialog and select the Allow use of fast copy connectors checkbox to turn on fast copy. Then close the Options dialog.

    Screenshot showing where to enable fast copy on the Scale tab of the Options dialog.

  5. Select Get data and then choose the ADLS Gen2 source, and fill in the details for your container.

  6. Use the Combine file functionality.

    Screenshot showing the Preview folder data window with the Combine option highlighted.

  7. To ensure fast copy, only apply transformations listed in the Connector support section of this article. If you need to apply more transformations, stage the data first, and reference the query later. Make other transformations on the referenced query.

  8. (Optional) You can set the Require fast copy option for the query by right-clicking on it to select and enable that option.

    Screenshot showing where to select the Require fast copy option on the right-click menu for a query.

  9. (Optional) Currently, you can only configure a Lakehouse as the output destination. For any other destination, stage the query and reference it later in another query where you can output to any source.

  10. Check the fast copy indicators to see if your query can run with fast copy. If so, the Engine type shows CopyActivity.

    Screenshot showing the refresh details indicating the pipeline CopyActivity engine was used.

  11. Publish the dataflow.

  12. Check after refresh completed to confirm fast copy was used.