Редагувати

Поділитися через


Incrementally copy new and changed files based on LastModifiedDate by using the Copy Data tool

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

In this tutorial, you'll use the Azure portal to create a data factory. You'll then use the Copy Data tool to create a pipeline that incrementally copies new and changed files only, from Azure Blob storage to Azure Blob storage. It uses LastModifiedDate to determine which files to copy.

After you complete the steps here, Azure Data Factory will scan all the files in the source store, apply the file filter by LastModifiedDate, and copy to the destination store only files that are new or have been updated since last time. Note that if Data Factory scans large numbers of files, you should still expect long durations. File scanning is time consuming, even when the amount of data copied is reduced.

Note

If you're new to Data Factory, see Introduction to Azure Data Factory.

In this tutorial, you'll complete these tasks:

  • Create a data factory.
  • Use the Copy Data tool to create a pipeline.
  • Monitor the pipeline and activity runs.

Prerequisites

  • Azure subscription: If you don't have an Azure subscription, create a free account before you begin.
  • Azure Storage account: Use Blob storage for the source and sink data stores. If you don't have an Azure Storage account, follow the instructions in Create a storage account.

Create two containers in Blob storage

Prepare your Blob storage for the tutorial by completing these steps:

  1. Create a container named source. You can use various tools to perform this task, like Azure Storage Explorer.

  2. Create a container named destination.

Create a data factory

  1. In the left pane, select Create a resource. Select Integration > Data Factory:

    Select Data Factory

  2. On the New data factory page, under Name, enter ADFTutorialDataFactory.

    The name for your data factory must be globally unique. You might receive this error message:

    New data factory error message for duplicate name.

    If you receive an error message about the name value, enter a different name for the data factory. For example, use the name yournameADFTutorialDataFactory. For the naming rules for Data Factory artifacts, see Data Factory naming rules.

  3. Under Subscription, select the Azure subscription in which you'll create the new data factory.

  4. Under Resource Group, take one of these steps:

    • Select Use existing and then select an existing resource group in the list.

    • Select Create new and then enter a name for the resource group.

    To learn about resource groups, see Use resource groups to manage your Azure resources.

  5. Under Version, select V2.

  6. Under Location, select the location for the data factory. Only supported locations appear in the list. The data stores (for example, Azure Storage and Azure SQL Database) and computes (for example, Azure HDInsight) that your data factory uses can be in other locations and regions.

  7. Select Create.

  8. After the data factory is created, the data factory home page appears.

  9. To open the Azure Data Factory user interface (UI) on a separate tab, select Open on the Open Azure Data Factory Studio tile:

    Home page for the Azure Data Factory, with the Open Azure Data Factory Studio tile.

Use the Copy Data tool to create a pipeline

  1. On the Azure Data Factory home page, select the Ingest tile to open the Copy Data tool:

    Screenshot that shows the ADF home page.

  2. On the Properties page, take the following steps:

    1. Under Task type, select Built-in copy task.

    2. Under Task cadence or task schedule, select Tumbling window.

    3. Under Recurrence, enter 15 Minute(s).

    4. Select Next.

    Copy data properties page

  3. On the Source data store page, complete these steps:

    1. Select + New connection to add a connection.

    2. Select Azure Blob Storage from the gallery, and then select Continue:

      Select Azure Blog Storage

    3. On the New connection (Azure Blob Storage) page, select your Azure subscription from the Azure subscription list and your storage account from the Storage account name list. Test the connection and then select Create.

    4. Select the newly created connection in the Connection block.

    5. In the File or folder section, select Browse and choose the source folder, and then select OK.

    6. Under File loading behavior, select Incremental load: LastModifiedDate, and choose Binary copy.

    7. Select Next.

    Screenshot that shows the 'Source data store' page.

  4. On the Destination data store page, complete these steps:

    1. Select the AzureBlobStorage connection that you created. This is the same storage account as the source data store.

    2. In the Folder path section, browse for and select the destination folder, and then select OK.

    3. Select Next.

    Screenshot that shows the 'Destination data store' page.

  5. On the Settings page, under Task name, enter DeltaCopyFromBlobPipeline, then select Next. Data Factory creates a pipeline with the specified task name.

    Screenshot that shows the Settings page.

  6. On the Summary page, review the settings and then select Next.

    Summary page

  7. On the Deployment page, select Monitor to monitor the pipeline (task).

    Deployment page

  8. Notice that the Monitor tab on the left is automatically selected. The application switches to the Monitor tab. You see the status of the pipeline. Select Refresh to refresh the list. Select the link under Pipeline name to view activity run details or to run the pipeline again.

    Refresh the list and view activity run details

  9. There's only one activity (the copy activity) in the pipeline, so you see only one entry. For details about the copy operation, on the Activity runs page, select the Details link (the eyeglasses icon) in the Activity name column. For details about the properties, see Copy activity overview.

    Copy activity in the pipeline

    Because there are no files in the source container in your Blob storage account, you won't see any files copied to the destination container in the account:

    No files in source container or destination container

  10. Create an empty text file and name it file1.txt. Upload this text file to the source container in your storage account. You can use various tools to perform these tasks, like Azure Storage Explorer.

    Create file1.txt and upload it to the source container

  11. To go back to the Pipeline runs view, select All pipeline runs link in the breadcrumb menu on the Activity runs page, and wait for the same pipeline to be automatically triggered again.

  12. When the second pipeline run completes, follow the same steps mentioned previously to review the activity run details.

    You'll see that one file (file1.txt) has been copied from the source container to the destination container of your Blob storage account:

    file1.txt has been copied from the source container to the destination container

  13. Create another empty text file and name it file2.txt. Upload this text file to the source container in your Blob storage account.

  14. Repeat steps 11 and 12 for the second text file. You'll see that only the new file (file2.txt) was copied from the source container to the destination container of your storage account during this pipeline run.

    You can also verify that only one file has been copied by using Azure Storage Explorer to scan the files:

    Scan files by using Azure Storage Explorer

Go to the following tutorial to learn how to transform data by using an Apache Spark cluster on Azure: