Load data to Lakehouse using partition in a Data pipeline
The partition feature in Lakehouse table as destination offers the capability to load data to Lakehouse table with partitions. The partitions are generated in Lakehouse destination, and then benefit the downstream jobs or consumption.
This tutorial helps you learn how to load data to Lakehouse using partition in a Data pipeline. As an example, you will try to load sample dataset into Lakehouse using one or multiple partition columns by taking the following steps. The sample dataset Public Holidays is used as sample data.
Prerequisite
- Make sure you have a Project Microsoft Fabric enabled Workspace: Create a workspace.
Create a Data pipeline
Navigate to Power BI.
Select the Power BI icon in the bottom left of the screen, then select Data factory to open homepage of Data Factory.
Navigate to your Microsoft Fabric workspace. If you created a new workspace in the prior Prerequisites section, use this one.
Select Data pipeline and then input a pipeline name to create a new pipeline.
Load data to Lakehouse using partition columns
Open your Data pipeline and add a copy activity by selecting Add pipeline activity -> Copy data. Under Source, select Sample dataset, and select Browse, then select Public Holidays.
Under Destination tab, select Workspace in Data store type, then select Lakehouse in Workspace data store type, specify your Lakehouse or select + New to create a new Lakehouse. Choose Table in Root folder and specify your table name.
Expand Advanced, in Table action, select Overwrite, and then select Enable partition, under Partition columns, select Add column, and choose the column you want to use as the partition column. You can choose to use a single column or multiple columns as the partition column.
If you use a single column, countryOrRegion (string type) is selected as an example in this tutorial. The data will be partitioned by different column values.
Note
The partition column that can be selected should be string, integer, boolean and datetime type. Columns of other data types are not displayed in the drop-down list.
If you use multiple partition columns, add one more column and select isPaidTimeOff which is boolean type as an example. Then run the pipeline. The logic is that the table is partitioned by the first added column values firstly, and then the partitioned data continue to be partitioned by the second added column values.
Tip
You can drag columns to change the sequence of columns, and the partition sequence will also change.
Select the Run and select Save and run to run the pipeline.
After the pipeline runs successfully, go to your Lakehouse. Find the table that you copied. Right-click the table name and select View files.
For one partition column (countryOrRegion), the table is partitioned to different folders by country or region names. The special character in column name is encoded, and you may see the file name is different from column values when you view files in Lakehouse.
For multiple partition columns, you will find the table is partitioned into different folders by country or region names.
Select one folder, for example contryOrRegion=United States. The table partitioned by the country or region name is partitioned again by the added second column isPaidTimeOff’s value:
True
orFalse
or__HIVE_DEFAULT_PARTITION__
(represents empty value in Sample dataset).Similarly, if you add three columns to partition the table, you will get the second level folder partitioned by the third column added.
Next steps
Next, advance to learn more about copy from Azure Blob Storage to Lakehouse.
Feedback
https://aka.ms/ContentUserFeedback.
Coming soon: Throughout 2024 we will be phasing out GitHub Issues as the feedback mechanism for content and replacing it with a new feedback system. For more information see:Submit and view feedback for