Quickstart: Create your first dataflow to get and transform data
Dataflows are a self-service, cloud-based, data preparation technology. In this article, you'll create your first dataflow, get data for your dataflow, then transform the data and publish the dataflow.
Microsoft Fabric is currently in PREVIEW. This information relates to a prerelease product that may be substantially modified before it's released. Microsoft makes no warranties, expressed or implied, with respect to the information provided here. Refer to Azure Data Factory documentation for the service in Azure.
The following prerequisites are required before you start:
- A Microsoft Fabric tenant account with an active subscription. Create a free account.
- Make sure you have a Microsoft Fabric enabled Workspace: Create a workspace.
Create a dataflow
In this section, you'll be creating your first dataflow.
Switch to the Data factory or Power BI experience.
Navigate to your Microsoft Fabric workspace.
Select New, and then select Dataflow Gen2.
Let's get some data! In this example, you'll be getting data from an OData service. Use the following steps to get data in your dataflow.
In the dataflow editor, select Get data and then select More.
Select OData as the data source.
Enter the URL
https://services.odata.org/v4/northwind/northwind.svc/, and then select Next.
Select the Orders and Customers tables, and then select Create.
You can learn more about the get data experience and functionality at Getting data overview.
Apply transformations and publish
You have now loaded your data into your first dataflow, congratulations! Now it's time to apply a couple of transformations in order to bring this data into the desired shape.
You'll be doing this task from the Power Query editor. You can find a detailed overview of the Power Query editor at The Power Query user interface.
Follow these steps to apply transformations and publish:
Make sure you've enabled the diagram view using the options under the View tab in the Power Query editor ribbon. Also ensure that the Data Profiling tools are enabled by navigating to Home > Options > Global Options.
Within the Orders table, you'll calculate the total number of orders per customer. To achieve this goal, select the CustomerID column in the data preview and then select Group By under the Transform tab in the ribbon.
You'll perform a count of rows as the aggregation within Group By. You can learn more about Group By capabilities at Grouping or summarizing rows.
After grouping data in the Orders table, we'll obtain a two-column table with CustomerID and Count as the columns.
Next, you want to combine data from the Customers table with the Count of Orders per customer. To combine data, select the Customers query in the Diagram View and use the "⋮" menu to access the Merge queries as new transformation.
Configure the Merge operation as shown in the following screenshot by selecting CustomerID as the matching column in both tables. Then select Ok.
Screenshot of the Merge window, with the Left table for merge set to the Customers table and the Right table for merge set to the Orders table. The CustomerID column is selected for both the Customers and Orders tables. Also, the Join Kind is set to Left outer. All other selections are set to their default value.
Upon performing the Merge queries as new operation, you'll obtain a new query with all columns from the Customers table and one column with nested data from the Orders table.
In this example, you're only interested in a subset of columns in the Customers table. You'll select those columns by using the schema view. Enable the schema view within the toggle button on the bottom-right corner of the dataflows editor.
The schema view provides a focused view into a table’s schema information, including column names and data types. Schema view has a set of schema tools available through a contextual ribbon tab. In this scenario, you'll select the CustomerID, CompanyName, and Orders (2) columns, then select the Remove columns button, and then select Remove other columns in the Schema tools tab.
The Orders (2) column contains nested information resulting from the merge operation you performed a few steps ago. Now, switch back to the data view by selecting the Show data view button next to the Show schema view button in the bottom-right corner of the UI. Then use the Expand Column transformation in the Orders (2) column header to select the Count column.
Screenshot of the dataflows editor with the Show data view button in the bottom-right corner, the expand column icon in the Orders (2) column, and the Count column selected in the expand column window.
As the final operation, you want to rank your customers based on their number of orders. Select the Count column and then select the Rank column button under the Add Column tab in the ribbon.
Keep the default settings in Rank Column. Then select OK to apply this transformation.
Now rename the resulting query as Ranked Customers using the Query settings pane on the right side of the screen.
You've finished transforming and combining your data. So, you'll now configure its output destination settings. Select Choose data destination at the bottom of the Query settings pane.
For this step, you can configure an output to your Azure SQL database if you have one available, or skip this step if you don't. Within this experience, you're able to configure the destination server, database, and table for your query results, in addition to the update method (Append or Replace).
Your dataflow is now ready to be published. Review the queries in the diagram view, and then select Publish.
You'll now be returned to the workspace. A spinner icon next to your dataflow name indicates publishing is in progress. Once the publishing completes, your dataflow is ready to refresh!
When the first Dataflow Gen2 is created in a workspace, Lakehouse and Warehouse items will be provisioned along with their related SQL Endpoint and Datasets and seen in the below image. These items are shared by all dataflows in the workspace and are required for Dataflows Gen2 to operate, should not be deleted, and should not be used outside of dataflows experience. The items are an implementation detail of Dataflow Gen2 and while currently visible, they will be hidden in the future.
In your workspace, select the Schedule Refresh icon.
Turn on the scheduled refresh, select "Add another time", and configure the refresh as shown in the following screenshot.
Screenshot of the scheduled refresh options, with scheduled refresh turned on, the refresh frequency set to Daily, the Time zone set to coordinated universal time, and the Time set to 4:00 AM. The on button, the Add another time selection, the dataflow owner, and the apply button are all emphasized.
Clean up resources
If you're not going to continue to use this dataflow, delete the dataflow using the following steps:
Navigate to your Microsoft Fabric workspace.
Select the vertical ellipsis next to the name of your dataflow and then select Delete.
Select Delete to confirm the deletion of your dataflow.
The dataflow in this sample shows you how to load and transform data in Dataflow Gen2. You learned how to:
- Create a Dataflow Gen2.
- Transform data.
- Configure destination settings for transformed data.
- Run and schedule your data pipeline.
Advance to the next article to learn how to create your first data pipeline.