How to create data flow using Azure SDK for python?

Question

How to create data flow using Azure SDK for python?

Subashri Vasudevan 11,226

Hi Team,

I am trying to create a simple data flow using azure sdk, with a source and sink, but unable to create link between the two. can you please provide the code snippet to do the same? There are hardly any resource for it,

thanks in advance.

Suba

VasimTamboli 5,215

from azure.identity import DefaultAzureCredential
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *

# Set up authentication
credential = DefaultAzureCredential()

# Provide your Azure subscription ID
subscription_id = 'your_subscription_id'

# Provide your resource group and data factory name
resource_group_name = 'your_resource_group_name'
data_factory_name = 'your_data_factory_name'

# Set up the Data Factory management client
data_factory_client = DataFactoryManagementClient(credential, subscription_id)

# Define the source and sink datasets
source_dataset = AzureBlobDataset(
    linked_service_name=LinkedServiceReference(reference_name='your_azure_blob_linked_service'),
    folder_path='your_source_folder_path',
    file_path='your_source_file_path'
)

sink_dataset = AzureBlobDataset(
    linked_service_name=LinkedServiceReference(reference_name='your_azure_blob_linked_service'),
    folder_path='your_sink_folder_path',
    file_name='your_sink_file_name'
)

# Define the data flow
data_flow_name = 'your_data_flow_name'
data_flow_resource = DataFlowResource(
    properties=DataFlow(
        type='MappingDataFlow',
        source=source_dataset,
        sinks=[sink_dataset]
    )
)

# Create the data flow
data_factory_client.data_flows.create_or_update(
    resource_group_name,
    data_factory_name,
    data_flow_name,
    data_flow_resource
)

In the above code:

You need to install the required packages: azure-mgmt-datafactory, azure-identity.

Replace the placeholders with your own values:

your_subscription_id: The Azure subscription ID.
your_resource_group_name: The name of your resource group.
your_data_factory_name: The name of your data factory.
your_azure_blob_linked_service: The name of the Azure Blob Storage linked service you have already created in your data factory.
your_source_folder_path: The folder path of the source dataset in Azure Blob Storage.
your_source_file_path: The file path of the source dataset in Azure Blob Storage.
your_sink_folder_path: The folder path of the sink dataset in Azure Blob Storage.
your_sink_file_name: The file name of the sink dataset in Azure Blob Storage.
your_data_flow_name: The name of your data flow.

This code snippet creates a data flow with a source dataset and a sink dataset, both using Azure Blob Storage. You can modify it based on your specific requirements and the type of data source/sink you want to use.

Subashri Vasudevan 11,226 Reputation points

2023-06-06T13:28:38.9666667+00:00

this doesnt work either. I see an error like below
Subashri Vasudevan 11,226 Reputation points

2023-06-08T01:38:38.0366667+00:00

Yes Sathish. It doesn't talk about data flow.
QuantumCache 20,366 Reputation points Moderator

2023-06-08T15:37:53.4433333+00:00

Thank you, Let me look for that!

Accepted answer

0 additional answers

Your answer

Subashri Vasudevan 11,226 Reputation points

2023-06-06T13:28:38.9666667+00:00

this doesnt work either. I see an error like below
Subashri Vasudevan 11,226 Reputation points

2023-06-08T01:38:38.0366667+00:00

Yes Sathish. It doesn't talk about data flow.
QuantumCache 20,366 Reputation points Moderator

2023-06-08T15:37:53.4433333+00:00

Thank you, Let me look for that!

Answer 1

Hello @Subashri Vasudevan,
Thanks for reaching out on this forum. I am checking on this query!
Did you try the following documentation?

Quickstart: Create a data factory and pipeline using Python

from azure.identity import ClientSecretCredential
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.datafactory import DataFactoryManagementClient
from azure.mgmt.datafactory.models import *
from datetime import datetime, timedelta
import time

def print_item(group):
    """Print an Azure object instance."""
    print("\tName: {}".format(group.name))
    print("\tId: {}".format(group.id))
    if hasattr(group, 'location'):
        print("\tLocation: {}".format(group.location))
    if hasattr(group, 'tags'):
        print("\tTags: {}".format(group.tags))
    if hasattr(group, 'properties'):
        print_properties(group.properties)

def print_properties(props):
    """Print a ResourceGroup properties instance."""
    if props and hasattr(props, 'provisioning_state') and props.provisioning_state:
        print("\tProperties:")
        print("\t\tProvisioning State: {}".format(props.provisioning_state))
    print("\n\n")

def print_activity_run_details(activity_run):
    """Print activity run details."""
    print("\n\tActivity run details\n")
    print("\tActivity run status: {}".format(activity_run.status))
    if activity_run.status == 'Succeeded':
        print("\tNumber of bytes read: {}".format(activity_run.output['dataRead']))
        print("\tNumber of bytes written: {}".format(activity_run.output['dataWritten']))
        print("\tCopy duration: {}".format(activity_run.output['copyDuration']))
    else:
        print("\tErrors: {}".format(activity_run.error['message']))

def main():
    # Azure subscription ID
    subscription_id = '5917e'

    # This program creates this resource group. If it's an existing resource group, comment out the code that creates the resource group
    rg_name = 'MyRG'

    # The data factory name. It must be globally unique.
    df_name = 'msftCommunityspace'

    # Specify your Active Directory client ID, client secret, and tenant ID
    credentials = ClientSecretCredential(client_id='1fba2e52', client_secret='j1R8WcrD', tenant_id='cd5218') 
    resource_client = ResourceManagementClient(credentials, subscription_id)
    adf_client = DataFactoryManagementClient(credentials, subscription_id)

    rg_params = {'location': 'eastus'}
    df_params = {'location': 'eastus'}

    # create the resource group
    # comment out if the resource group already exists
    # resource_client.resource_groups.create_or_update(rg_name, rg_params)

    # Create an Azure Storage linked service
    ls_name = 'ls_storageLinkedService823'

    # IMPORTANT: Specify the name and key of your Azure Storage account.
    storage_string = SecureString(value='DefaultEndpointsProtocol=https;AccountName=communiestorage;AccountKey=O1fk==;EndpointSuffix=core.windows.net')

    ls_azure_storage = LinkedServiceResource(properties=AzureStorageLinkedService(connection_string=storage_string)) 
    ls = adf_client.linked_services.create_or_update(rg_name, df_name, ls_name, ls_azure_storage)
    print_item(ls)


    
    response = adf_client.datasets.create_or_update(
        resource_group_name=rg_name,
        factory_name=df_name,
        dataset_name="ds_out_Blob",
        dataset={
            "properties": {
                "linkedServiceName": {"referenceName": "ls_storageLinkedService823", "type": "LinkedServiceReference"},
                "type": "DelimitedText",
                "typeProperties": {
                        "location": {
                        "type": "AzureBlobStorageLocation",
                        "fileName": "SalesheaderOut.csv",
                        "folderPath": "output",
                        "container": "adfcont"
                                    },
                        "columnDelimiter": ",",
                        "escapeChar": "\\",
                        "firstRowAsHeader": "true",
                        "quoteChar": "\""
        },
        "schema": []
            }
        },
    )



    response = adf_client.datasets.create_or_update(
        resource_group_name=rg_name,
        factory_name=df_name,
        dataset_name="ds_in_Blob",
        dataset={
            "properties": {
                "linkedServiceName": {"referenceName": "ls_storageLinkedService823", "type": "LinkedServiceReference"},
                "type": "DelimitedText",
                "typeProperties": {
                "location": {
                "type": "AzureBlobStorageLocation",
                "fileName": "Salesheader.csv",
                "container": "inbound"
            },
            "columnDelimiter": ",",
            "escapeChar": "\\",
            "quoteChar": "\""
        },
                "schema": [
            {
                "type": "String"
            },
            {
                "type": "String"
            },
            {
                "type": "String"
            },
            {
                "type": "String"
            },
            {
                "type": "String"
            },
            {
                "type": "String"
            }
                ]
            }
        },
    )

    response = adf_client.data_flows.create_or_update(
        resource_group_name=rg_name,
        factory_name=df_name,
        data_flow_name="exampleDataFlow1",
        data_flow={
            "properties": {
                "description": "Sample demo data flow to convert currencies showing usage of union, derive and conditional split transformation.",
                "type": "MappingDataFlow",
                "typeProperties": {
                    "scriptLines": [
                                        "source(allowSchemaDrift: true,",
                "     validateSchema: false,",
                "     ignoreNoFilesFound: false) ~> source1",
                "source1 sink(allowSchemaDrift: true,",
                "     validateSchema: false,",
                "     skipDuplicateMapInputs: true,",
                "     skipDuplicateMapOutputs: true) ~> sink1"
                    ],
                    "sinks": [
                        {
                        "dataset": {
                        "referenceName": "ds_out_Blob",
                        "type": "DatasetReference"
                    },
                        "name": "sink1"
                        }
                    ],
                    "sources": [
                        {
                        "dataset": {
                        "referenceName": "ds_in_Blob",
                        "type": "DatasetReference"
                    },
                        "name": "source1"
                        },
                    ],
                },
            }
        },
    )
    print(response)


# Start the main method
main()

QuantumCache 20,366 Reputation points Moderator

2023-06-09T04:42:07.3133333+00:00

Hello @Subashri Vasudevan,
Please refer to the above Python Script to create the DataFlow. Please take care of the DataSet creation and linking it in the DataFlow, I am leaving that to you!

Please let me know how did it go?
QuantumCache 20,366 Reputation points Moderator

2023-06-09T15:03:43.8533333+00:00

Hello @Subashri Vasudevan, Did you get a chance to see the above Python Script which i have provided, please do let us know and we can close this thread!
QuantumCache 20,366 Reputation points Moderator

2023-06-12T18:15:21.84+00:00

Hello @Subashri Vasudevan,If the response is helpful, please click "Accept Answer" and upvote it. So that we can close this thread.
Subashri Vasudevan 11,226 Reputation points

2023-06-13T02:29:56.9333333+00:00

Hi Satish, I'm yet to verify this. Please give me another day. Thanks for your help
Subashri Vasudevan 11,226 Reputation points

2023-06-17T12:30:22.2133333+00:00

SatishBoddu-MSFT i having some environment issue, couldnt validate it right now. Apologies. Once issue is resolved, i will confirm.
QuantumCache 20,366 Reputation points Moderator

2023-06-17T19:13:24.4366667+00:00

Ok, No worries, Take your Time, Thanks for the response!
CON-Thirusenthilkumar Pandiyan 45 Reputation points

2023-09-19T11:10:17.26+00:00

@SatishBoddu-MSFT could you pls help. i have generated the dataflow and copy activity programatically using python. i want to create a pipeline and the first step will be dataflow (which takes the config/metdata info from DB) and followed by the copy activity will run. However, i am unable to get a clue how to encapsulate copy activity and dataflow (link) in pythonn.

Share via

How to create data flow using Azure SDK for python?

0 additional answers

Your answer