Exercise - Build an Azure Data Factory pipeline to copy data

Completed

A data factory is a service for processing structured and unstructured data from any source. A pipeline is a logical grouping of activities that together perform a task. The activities in a pipeline define actions to perform on your data.

In this exercise, you’ll create an Azure Data Factory pipeline that will connect to your Dataverse environment and will copy three columns of the Emission table into a JSON file in blob storage.

Create a Microsoft Entra ID app registration

The Data Factory pipeline will use the Microsoft Entra ID app registration to gain access to your Dataverse environment.

  1. Go to the Azure portal.

  2. Go to Microsoft Entra ID.

  3. Go to App registrations.

  4. Select New registration.

    • Enter any name, such as adf-mc4s.

    • Select Single tenant for the Supported account types section.

    • Leave the Redirect URI section blank.

    • Select Register.

      Screenshot of registration page to register an application.

    The app registration will be created, and the Overview tab will open.

  5. Note the Application (client) ID because you’ll need it later in this exercise.

  6. Select Certificate & secrets > Client secrets > New client secret.

    Screenshot highlighting Client secrets and option to add new client secret under Certificates & secrets.

  7. Enter any description, retain the default expiration, and then select Add.

  8. Copy the new client secret because you’ll need it later in this exercise.

    Warning

    You won’t be able to retrieve the secret later, so make sure that you copy it without fail.

Grant access to Dataverse

In this step, you’ll create an application user that’s linked to the app registration and then you’ll grant access to Dataverse.

  1. Go to Microsoft Power Platform admin center.

  2. Go to Environments and select the environment where your Microsoft Cloud for Sustainability is installed.

  3. Note your Environment URL because you’ll need this information later in this exercise. The URL will resemble org12345.crm2.dynamics.com

  4. Select Settings in the toolbar at the top.

  5. Expand Users + permissions and then select Application users.

    Screenshot of power platform admin center highlighting application users.

  6. Select New app user.

  7. Select Add an app.

  8. From the list, select the app registration that you created in the previous step (for example, adf-mc4s), and then select Add.

  9. In the Business units section, select the organization that matches the environment URL that you took note of previously.

  10. Select the edit icon to the right of Security roles.

  11. Select System administrator from the list, select Save, and then select Create.

    Note

    For simplicity, Data Factory has the System Administrator role in this exercise. However, in a production environment, you would create a specific role with only the needed permissions.

Create a storage account

In this step, you’ll create the storage account to which the data factory pipeline will write the output files.

  1. Go to the Azure portal.

  2. Create a new storage account resource:

    • Give any name, such as samc4s.

    • Select a region (preferably the same region where your Microsoft Cloud for Sustainability environment is deployed).

    • In the Redundancy dropdown menu, select Locally-redundant storage (LRS) .

    • Retain everything else as default and then create the resource.

      Screenshot highlighting creation of storage account under adf-mc4s resource group.

  3. Select Go to resource. Go to Containers and select + Container in the toolbar to create a new container.

  4. Enter the details as follows:

    • Name - adf-output

    • Public access level - Private

      Screenshot showing containers creation under storage account.

  5. Select Create.

Create a Data Factory resource

To create a Data Factory resource, consider the following steps:

  1. From the Azure portal, create a new Data Factory resource.

    Screenshot showing data factory resource creation in Azure portal.

  2. On the Basics tab, enter a name, such as adf-mc4s, and then select the same region that you previously selected for the storage account.

    Screenshot highlighting creation of data factory under subscription and resource group.

  3. On the Git configuration tab, select Configure Git later.

    Screenshot highlighting Git configuration option in creation of data factory.

  4. Retain all other options as default and then create the Data Factory resource.

  5. When the resource is created, from the Overview tab, select Launch Studio.

Create the Data Factory Linked services

To create the linked services to Dataverse (for the pipeline input) and to the storage account (for the pipeline output), follow these steps:

  1. From the Azure Data Factory Studio, select the Manage icon (the last icon in the toolbar on the left) and then select Linked services > New.

    Screenshot highlighting linked services under data factory connections.

  2. Search for Dataverse, select Dataverse (Common Data Service for Apps), and then select Continue.

  3. Fill in the New linked service page as follows:

    • Name - MC4S Dataverse Link

    • Service Uri - Enter the environment URL that you previously took note of in the exercise

    • Service principal ID - Enter the Application (client) ID that you previously took note of in the exercise

    • Service principal key - Enter the secret key that you previously created in the exercise

      Screenshot highlighting new linked sevice name, URL, service principal ID, service principal key.

  4. Select Test connection to validate the connection.

  5. Select Create.

  6. On the Linked services page, select New. You’ll now create the output connection to the storage account.

  7. Search for storage and then select Azure Blob Storage.

    Screenshot highlighting Azure blob storage under data store in new linked service.

  8. Select Continue.

  9. Fill in the New linked service page as follows:

    • Name - Blob storage link
    • Storage account name - Select from the list the storage account that you previously created in this exercise
  10. Select Test connection to validate the connection.

  11. Select Create.

    Screenshot highlighting storage account name and create new linked service. Showing Test Connection successful.

    Two linked services will be displayed on the Linked services page.

    Screenshot showing two linked service is displayed on the linked service page.

Create the input dataset

A dataset is a named view of data that points to or references the data that you want to use in your activities as inputs and outputs.

To create the input dataset, complete the following steps:

  1. On the left toolbar of the Azure Data Factory portal, select the Author icon (the second icon from the top). Select the plus (+) symbol, and then select Dataset to add a dataset.

    Screenshot highlighting add dataset under factory resources.

  2. On the New dataset page, search for Dataverse, select Dataverse (Common Data Service for Apps), and then select Continue.

  3. In the Set properties page, fill in the form as follows:

    1. Enter any name, such as Emission.

    2. Select MC4S Dataverse Link from the Linked service list.

    3. Select Emission (msdyn_emission) from the Entity name list.

    4. Select OK.

      Screenshot showing Set properties page where you can provide details such as name and linked service.

  4. Review and test the connection and then select Publish all > Publish.

    Screenshot highlighting test connection is successful and publish all button.

Create the output dataset

To create the output dataset, follow these steps:

  1. On the left toolbar of the Azure Data Factory portal, select the Author icon, select the plus (+) symbol, and then select Dataset to add a dataset.

    Screenshot highlighting add dataset option under factory resources.

  2. On the New dataset page, search for storage, select Azure Blob Storage, and then select Continue.

  3. On the Select format page, select JSON, and then select Continue.

  4. On the Set properties page, fill in the form as follows:

    1. Enter any name, such as OutputEmissions.

    2. Select Blob storage link from the Linked service dropdown list.

    3. Enter adf-output / mc4s for the file path, and then leave the last box blank.

    4. Select From sample file for Import schema.

    5. Download this file locally and then select it with the Browse button.

    6. Select OK.

      Screenshot showing Set properties page where you can fill details like Name, Linked Service, file path.

  5. Review and test the connection and then select Publish all > Publish.

    Screenshot highlighting Test connection is successful and Publish all.

Create a Data Factory pipeline

To create a Data Factory pipeline, follow these steps:

  1. On the left toolbar of the Azure Data Factory portal, select the Author icon, select the plus (+) symbol, and then select Pipeline > Pipeline.

    Screenshot highlighting add pipiline under factory resources in data factory.

  2. Expand Move & transform and then drag the Copy data activity into the design surface.

    Screenshot highlighting Copy data is dragged into design surface.

  3. On the Source tab, select Emissions from the Source dataset dropdown menu.

    Screenshot highlighting source dataset as Emissions under source tab.

  4. On the Sink tab, select OutputEmissions from the Sink dataset dropdown menu, and then select Array of objects from the File pattern dropdown menu.

    Screenshot highlighting sink dataset as Output Emissions under sink tab.

  5. On the Mapping tab, select Import schemas.

    Screenshot highlighting Import schema under mapping tab.

  6. Define the mapping as follows:

    • msdyn_transactiondate > TransactionDate

    • msdyn_activityname > Activity

    • msdyn_co2 > CO2E

      Screenshot showing source is mapped as following msdyn_transactiondate to TransactionDate, msdyn_activityname to Activity, msdyn_co2e to CO2E.

  7. Select Debug and then wait for the pipeline to run.

    Screenshot highlighting Debug to select and the Pipiline runs succeeded.

  8. Select Publish all > Publish to save the pipeline. If you want to run the pipeline again without debugging, select Add trigger > Trigger now.

  9. Return to the Azure portal and go to the storage account that you previously created.

  10. Select Containers and then select the adf-output container.

    SCreenshot highlighting a container as adf-output under containers in storage account.

  11. Open the mc4s folder, select msdyn_emission.json, and then select Download.

    Screenshot highlighting to open mc4s folder then select json file and download it.

  12. Open the JSON file and confirm that the three mapped columns from the Emission table display in JSON format.

    Screenshot highlighting three mapped column from Emission table is displaying in JSON format.