Module 2: Transform data with a dataflow in Data Factory

This module takes about 25 minutes to create a dataflow, apply transformations, and move the raw data from the Bronze table into a Gold Lakehouse table.

With the raw data loaded into your Bronze Lakehouse table from the last module, you can now prepare that data and enrich it by combining it with another table that contains discounts for each vendor and their trips during a particular day. This final Gold Lakehouse table is loaded and ready for consumption.

The high-level steps in the dataflow are as follows:

  • Get raw data from the Lakehouse table created by the Copy activity in Module 1: Create a pipeline with Data Factory.
  • Transform the data imported from the Lakehouse table.
  • Connect to a CSV file containing discounts data.
  • Transform the discounts data.
  • Combine trips and discounts data.
  • Load the output query into the Gold Lakehouse table.

Get data from a Lakehouse table

  1. From the sidebar, select Create, and then Dataflow Gen2 to create a new dataflow gen2. Screenshot showing the Fabric Create page with the Dataflow Gen2 button highlighted.

  2. From the new dataflow menu, select Get data, and then More....

    Screenshot showing the Dataflow menu with Get data button highlighted and the More... option highlighted from its menu.

  3. Search for and select the Lakehouse connector.

    Screenshot showing the selection of the Lakehouse data source from the Choose data source menu.

  4. The Connect to data source dialog appears, and a new connection is automatically created for you based on the currently signed in user. Select Next.

    Screenshot showing the configuration of the data source settings for your new Lakehouse with your current signed in user, and the Next button selected.

  5. The Choose data dialog is displayed. Use the navigation pane to find the Lakehouse you created for the destination in the prior module, and select the Tutorial_Lakehouse data table.

    Screenshot showing the Lakehouse browser with the workspace, lakehouse, and table created with the Copy activity in module 1.

  6. (Optional) Once your canvas is populated with the data, you can set column profile information, as this is useful for data profiling. You can apply the right transformation and target the right data values based on it.

    To do this, select Options from the ribbon pane, then select the first three options under Column profile, and then select OK.

    Screenshot showing the column options selection for your data.

Transform the data imported from the Lakehouse

  1. Select the data type icon in the column header of the second column, IpepPickupDatetime, to display a dropdown menu and select the data type from the menu to convert the column from the Date/Time to Date type. Screenshot showing the selection of the Date data type for the IpepPickupDatetime column.

  2. (Optional) On the Home tab of the ribbon, select the Choose columns option from the Manage columns group.

    Screenshot showing the Choose columns button on the Home tab of the dataflow editor.

  3. (Optional) On the Choose columns dialog, deselect some columns listed here, then select OK.

    • lpepDropoffDatetime
    • puLocationId
    • doLocationId
    • pickupLatitude
    • dropoffLongitude
    • rateCodeID

    Screenshot showing the Choose columns dialog with the identified columns deselected.

  4. Select the storeAndFwdFlag column's filter and sort dropdown menu. (If you see a warning List may be incomplete, select Load more to see all the data.)

    Screenshot showing the filter and sort dialog for the column.

  5. Select 'Y' to show only rows where a discount was applied, and then select OK.

    Screenshot showing the values filter with only 'Y' selected.

  6. Select the IpepPickupDatetime column sort and filter dropdown menu, then select Date filters, and choose the Between... filter provided for Date and Date/Time types.

    Screenshot showing the selection of the Date filters option in the column sort and format dropdown.

  7. In the Filter rows dialog, select dates between January 1, 2015, and January 31, 2015, then select OK.

    Screenshot showing the selection of the dates in January 2015.

Connect to a CSV file containing discount data

Now, with the data from the trips in place, we want to load the data that contains the respective discounts for each day and VendorID, and prepare the data before combining it with the trips data.

  1. From the Home tab in the dataflow editor menu, select the Get data option, and then choose Text/CSV.

    Screenshot showing the selection of the Get data menu from the Home tab, with Text/CSV highlighted.

  2. On the Connect to data source dialog, provide the following details:

    • File path or URL - https://raw.githubusercontent.com/ekote/azure-architect/master/Generated-NYC-Taxi-Green-Discounts.csv
    • Authentication kind - Anonymous

    Then select Next.

    Screenshot showing the Text/CSV settings for the connection.

  3. On the Preview file data dialog, select Create.

    Screenshot showing the Preview file data dialog with the Create button highlighted.

Transform the discount data

  1. Reviewing the data, we see the headers appear to be in the first row. Promote them to headers by selecting the table's context menu at the top left of the preview grid area to select Use first row as headers.

    Screenshot showing the selection of the Use first row as headers option from the table context menu.

    Note

    After promoting the headers, you can see a new step added to the Applied steps pane at the top of the dataflow editor to the data types of your columns.

  2. Right-click the VendorID column, and from the context menu displayed, select the option Unpivot other columns. This allows you to transform columns into attribute-value pairs, where columns become rows.

    Screenshot showing the context menu for the VendorID column with the Unpivot other columns selection highlighted.

  3. With the table unpivoted, rename the Attribute and Value columns by double-clicking them and changing Attribute to Date and Value to Discount.

    Screenshot showing the table columns after renaming Attribute to Date and Value to Discount.

  4. Change the data type of the Date column by selecting the data type menu to the left of the column name and choosing Date.

    Screenshot showing the selection of the Date data type for the Date column.

  5. Select the Discount column and then select the Transform tab on the menu. Select Number column, and then select Standard numeric transformations from the submenu, and choose Divide.

    Screenshot showing the selection of the Divide option to transform data in the Discount column.

  6. On the Divide dialog, enter the value 100.

    Screenshot showing the Divide dialog with the value 100 entered and the OK button highlighted.

Combine trips and discounts data

The next step is to combine both tables into a single table that has the discount that should be applied to the trip, and the adjusted total.

  1. First, toggle the Diagram view button so you can see both of your queries.

    Screenshot showing the Diagram view toggle button with both queries created in this tutorial displayed.

  2. Select the nyc_taxi query, and on the Home tab, Select the Combine menu and choose Merge queries, then Merge queries as new.

    Screenshot showing the Merge queries as new selection for the nyc_taxi query.

  3. On the Merge dialog, select Generated-NYC-Taxi-Green-Discounts from the Right table for merge drop down, and then select the "light bulb" icon on the top right of the dialog to see the suggested mapping of columns between the two tables.

    Screenshot showing the configuration of the Merge dialog with suggested column mappings displayed.

    Choose each of the two suggested column mappings, one at a time, mapping the VendorID and date columns from both tables. When both mappings are added, the matched column headers are highlighted in each table.

  4. A message is shown asking you to allow combining data from multiple data sources to view the results. Select OK on the Merge dialog.

    Screenshot showing the request to approve combining data from multiple data sources, with the OK button highlighted.

  5. In the table area, you'll initially see a warning that "The evaluation was canceled because combining data from multiple sources may reveal data from one source to another. Select continue if the possibility of revealing data is okay." Select Continue to display the combined data.

    Screenshot showing the warning about combining data from multiple data sources with the Continue button highlighted.

  6. Notice how a new query was created in Diagram view showing the relationship of the new Merge query with the two queries you previously created. Looking at the table pane of the editor, scroll to the right of the Merge query column list to see a new column with table values is present. This is the "Generated NYC Taxi-Green-Discounts" column, and its type is [Table]. In the column header there's an icon with two arrows going in opposite directions, allowing you to select columns from the table. Deselect all of the columns except Discount, and then select OK.

    Screenshot showing the merged query with the column selection menu displayed for the newly generated column Generated-NYC-Taxi-Green-Discounts.

  7. With the discount value now at the row level, we can create a new column to calculate the total amount after discount. To do so, select the Add column tab at the top of the editor, and choose Custom column from the General group.

    Screenshot showing the Add custom column button highlighted on the General section of the Add column tab.

  8. On the Custom column dialog, you can use the Power Query formula language (also known as M) to define how your new column should be calculated. Enter TotalAfterDiscount for the New column name, select Currency for the Data type, and provide the following M expression for the Custom column formula:

    if [totalAmount] > 0 then [totalAmount] * ( 1 -[Discount] ) else [totalAmount]

    Then select OK.

    Screenshot showing the Custom column configuration screen with the New column name, Data type and Custom column formula highlighted.

  9. Select the newly create TotalAfterDiscount column and then select the Transform tab at the top of the editor window. On the Number column group, select the Rounding drop down and then choose Round....

    Screenshot showing the Round... option on the Transform tab of the editor window.

  10. On the Round dialog, enter 2 for the number of decimal places and then select OK.

    Screenshot showing the Round dialog with 2 for the number of decimal places and the OK button highlighted.

  11. Change the data type of the IpepPickupDatetime from Date to Date/Time.

    Screenshot showing the selection of the Date/Time data type for the IpepPickupDatetime column.

  12. Finally, expand the Query settings pane from the right side of the editor if it isn't already expanded, and rename the query from Merge to Output.

    Screenshot showing the renaming of the query from Merge to Output.

Load the output query to a table in the Lakehouse

With the output query now fully prepared and with data ready to output, we can define the output destination for the query.

  1. Select the Output merge query created previously. Then select the Home tab in the editor, and Add data destination from the Query grouping, to select a Lakehouse destination.

    Screenshot showing the Add data destination button with Lakehouse highlighted.

  2. On the Connect to data destination dialog, your connection should already be selected. Select Next to continue.

  3. On the Choose destination target dialog, browse to the Lakehouse where you wish to load the data and name the new table nyc_taxi_with_discounts, then select Next again.

    Screenshot showing the Choose destination target dialog with Table name nyc_taxi_with_discounts.

  4. On the Choose destination settings dialog, leave the default Replace update method, double check that your columns are mapped correctly, and select Save settings.

    Screenshot showing the Choose destination settings dialog with the Save settings button highlighted.

  5. Back in the main editor window, confirm that you see your output destination on the Query settings pane for the Output table, and then select Publish.

    Important

    When the first Dataflow Gen2 is created in a workspace, Lakehouse and Warehouse items are provisioned along with their related SQL analytics endpoint and semantic models. These items are shared by all dataflows in the workspace and are required for Dataflow Gen2 to operate, shouldn't be deleted, and aren't intended to be used directly by users. The items are an implementation detail of Dataflow Gen2. The items aren't visible in the workspace, but might be accessible in other experiences such as the Notebook, SQL-endpoint, Lakehouse, and Warehouse experiences. You can recognize the items by their prefix in the name. The prefix of the items is `DataflowsStaging'.

  6. (Optional) On the workspace page, you can rename your dataflow by selecting the ellipsis to the right of the dataflow name that appears after you select the row, and choosing Properties.

    Screenshot showing the Properties option selected on the menu for a dataflow where it can be renamed.

  7. Select the refresh icon for the dataflow after selecting its row, and when complete, you should see your new Lakehouse table created as configured in the Data destination settings.

    Screenshot showing the selection of the refresh button to refresh the dataflow.

  8. Check your Lakehouse to view the new table loaded there.

In this second module to our end-to-end tutorial for your first data integration using Data Factory in Microsoft Fabric, you learned how to:

  • Create a new Dataflow Gen2.
  • Import and transform sample data.
  • Import and transform text/CSV data.
  • Merge data from both data sources into a new query.
  • Transform data and generate new columns in a query.
  • Configure an output destination source for a query.
  • Rename and refresh your new dataflow.

Continue to the next section now to integrate your data pipeline.