Azure Data Facory Pipeline for copying data to Azure Delta Lake from Blob Storage

Rajeev Singh 21 Reputation points
2023-04-07T03:37:36.05+00:00

Hi, I have requirement to copy data from Blob Storage to Databricks Delta Lake Tables/db. But unable to perform task using sink and staging method. Regards, Rajeev

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,478 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,960 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,711 questions
{count} votes

Accepted answer
  1. VasimTamboli 4,420 Reputation points
    2023-04-08T12:05:41.9833333+00:00

    Hi Rajeev,

    To copy data from Blob Storage to Databricks Delta Lake, you can use Azure Data Factory (ADF) pipeline with a Delta Lake sink.

    Here are the high-level steps to perform this task:

    Create a new Azure Data Factory in the Azure portal. Create a new pipeline in the Data Factory. Add a Blob Storage source to the pipeline and configure it to read the data you want to copy. Add a Delta Lake sink to the pipeline and configure it to write the data to your Delta Lake table/db. You'll need to provide the JDBC URL, username, password, and table name. Optionally, you can use a staging directory to improve performance. In this case, you'll need to add a copy activity to the pipeline that first copies the data from Blob Storage to the staging directory, and then copies it from the staging directory to Delta Lake. Here's an example pipeline JSON that copies data from Blob Storage to Delta Lake:

    json Copy code { "name": "CopyData", "properties": { "activities": [ { "name": "CopyFromBlob", "type": "Copy", "inputs": [ { "name": "BlobInput" } ], "outputs": [ { "name": "StagingOutput" } ], "typeProperties": { "source": { "type": "BlobSource", "recursive": true }, "sink": { "type": "FileSystemSink", "rootFolderPath": "wasbs://<your-storage-account-name>@<your-container-name>.blob.core.windows.net/staging/", "copyBehavior": "FlattenHierarchy", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00" } }, "policy": { "timeout": "7.00:00:00", "concurrency": 1, "executionPriorityOrder": "NewestFirst" } }, { "name": "CopyToDeltaLake", "type": "Copy", "dependsOn": [ { "activity": "CopyFromBlob", "dependencyConditions": [ "Succeeded" ] } ], "inputs": [ { "name": "StagingOutput" } ], "outputs": [ { "name": "DeltaLakeOutput" } ], "typeProperties": { "source": { "type": "FileSystemSource", "recursive": true }, "sink": { "type": "DeltaLakeSink", "writeBatchSize": 0, "writeBatchTimeout": "00:00:00", "enableAutoCompaction": true, "maxBytesCompacted": "5368709120" }, "enableStaging": true, "stagingLinkedService": { "referenceName": "<your-storage-account-name>", "type": "AzureBlobFS", "typeProperties": { "url": "wasbs://<your-storage-account-name>@<your-container-name>.blob.core.windows.net/staging/", "recursive": true, "enableServerSideEncryption": true, "encryptionType": "ServiceManaged", "fileName": "" } } }, "policy": { "timeout": "7.00:00:00", "concurrency": 1, "executionPriorityOrder": "NewestFirst" } } ], "parameters": { "BlobInput": { "type": "String

    4 people found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. AnnuKumari-MSFT 31,716 Reputation points Microsoft Employee
    2023-04-10T07:52:10.3033333+00:00

    Hi Rajeev Singh , Welcome to Microsoft Q&A platform and thanks for posting your query here.

    As I understand your question, you want to copy data from postGre SQL database to Databricks Delta table using ADF pipeline while staging the data. Please let me know if that is not the ask here.

    In order to perform data copy from postGre to Delta table, you can use PostGre Database connector to create your source dataset and Databricks delta lake as sink dataset.

    When your source data store or format does not match the direct copy criteria ,you can enable the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides you better throughput.

    The service automatically converts the data to meet the data format requirements into staging storage, then load data into delta lake from there. Finally, it cleans up your temporary data from the storage. See Staged copy for details about copying data using staging.

    For more details, please go through the following official documentation on Staged copy to delta lake while Copying data to and from Azure Databricks Delta Lake using Azure Data Factory or Azure Synapse Analytics User's image

    User's image

    Hope it helps. Kindly accept the answer if it helps. Thanks