Transform data in delta lake using mapping data flows
APPLIES TO: Azure Data Factory Azure Synapse Analytics
If you're new to Azure Data Factory, see Introduction to Azure Data Factory.
In this tutorial, you'll use the data flow canvas to create data flows that allow you to analyze and transform data in Azure Data Lake Storage (ADLS) Gen2 and store it in Delta Lake.
- Azure subscription. If you don't have an Azure subscription, create a free Azure account before you begin.
- Azure storage account. You use ADLS storage as a source and sink data stores. If you don't have a storage account, see Create an Azure storage account for steps to create one.
The file that we are transforming in this tutorial is MoviesDB.csv, which can be found here. To retrieve the file from GitHub, copy the contents to a text editor of your choice to save locally as a .csv file. To upload the file to your storage account, see Upload blobs with the Azure portal. The examples will be referencing a container named 'sample-data'.
Create a data factory
In this step, you create a data factory and open the Data Factory UX to create a pipeline in the data factory.
Open Microsoft Edge or Google Chrome. Currently, Data Factory UI is supported only in the Microsoft Edge and Google Chrome web browsers.
On the left menu, select Create a resource > Integration > Data Factory
On the New data factory page, under Name, enter ADFTutorialDataFactory
Select the Azure subscription in which you want to create the data factory.
For Resource Group, take one of the following steps:
a. Select Use existing, and select an existing resource group from the drop-down list.
b. Select Create new, and enter the name of a resource group.
To learn about resource groups, see Use resource groups to manage your Azure resources.
Under Version, select V2.
Under Location, select a location for the data factory. Only locations that are supported are displayed in the drop-down list. Data stores (for example, Azure Storage and SQL Database) and computes (for example, Azure HDInsight) used by the data factory can be in other regions.
After the creation is finished, you see the notice in Notifications center. Select Go to resource to navigate to the Data factory page.
Select Author & Monitor to launch the Data Factory UI in a separate tab.
Create a pipeline with a data flow activity
In this step, you'll create a pipeline that contains a data flow activity.
On the home page, select Orchestrate.
In the General tab for the pipeline, enter DeltaLake for Name of the pipeline.
In the Activities pane, expand the Move and Transform accordion. Drag and drop the Data Flow activity from the pane to the pipeline canvas.
In the Adding Data Flow pop-up, select Create new Data Flow and then name your data flow DeltaLake. Click Finish when done.
In the top bar of the pipeline canvas, slide the Data Flow debug slider on. Debug mode allows for interactive testing of transformation logic against a live Spark cluster. Data Flow clusters take 5-7 minutes to warm up and users are recommended to turn on debug first if they plan to do Data Flow development. For more information, see Debug Mode.
Build transformation logic in the data flow canvas
You will generate two data flows in this tutorial. The first data flow is a simple source to sink to generate a new Delta Lake from the movies CSV file from above. Lastly, you'll create this flow design below to update data in Delta Lake.
- Take the MoviesCSV dataset source from above, and form a new Delta Lake from it. 1. Build the logic to updated ratings for 1988 movies to '1'.
- Delete all movies from 1950.
- Insert new movies for 2021 by duplicating the movies from 1960.
Start from a blank data flow canvas
Click on the source transformation
Click new next to dataset in the bottom panel 1 Create a new Linked Service for ADLS Gen2
Choose Delimited Text for the dataset type
Name the dataset “MoviesCSV”
Point to the MoviesCSV file that you uploaded to storage above
Set it to be comma delimited and include header on first row
Go to the source projection tab and click "Detect data types"
Once you have your projection set, you can continue
Add a sink transformation
Delta is an inline dataset type. You will need to point to your ADLS Gen2 storage account.
Choose a folder name in your storage container where you would like ADF to create the Delta Lake
Go back to the pipeline designer and click Debug to execute the pipeline in debug mode with just this data flow activity on the canvas. This will generate your new Delta Lake in ADLS Gen2.
From Factory Resources, click new > Data flow
Use the MoviesCSV again as a source and click "Detect data types" again
Add a filter transformation to your source transformation in the graph
Only allow movie rows that match the three years you are going to work with which will be 1950, 1988, and 1960
Update ratings for each 1988 movie to '1' by now adding a derived column transformation to your filter transformation
In that same derived column, create movies for 2021 by taking an existing year and change the year to 2021. Let’s pick 1960.
This is what your three derived columns will look like
Update, insert, delete, and upsertpolicies are created in the alter Row transform. Add an alter row transformation after your derived column.
Your alter row policies should look like this.
Now that you’ve set the proper policy for each alter row type, check that the proper update rules have been set on the sink transformation
Here we are using the Delta Lake sink to your ADLS Gen2 data lake and allowing inserts, updates, deletes.
Note that the Key Columns are a composite key made up of the Movie primary key column and year column. This is because we created fake 2021 movies by duplicating the 1960 rows. This avoids collisions when looking up the existing rows by providing uniqueness.
Download completed sample
Learn more about the data flow expression language.