Power BI usage scenarios: Advanced data preparation
Note
This article forms part of the Power BI implementation planning series of articles. This series focuses primarily on the Power BI experience within Microsoft Fabric. For an introduction to the series, see Power BI implementation planning.
Data preparation (sometimes referred to as ETL, which is an acronym for Extract, Transform, and Load) activities often involve a large effort. The time, skill, and effort involved with collecting, cleaning, combining, and enriching data depends on the quality and structure of source data.
Investing time and effort in centralized data preparation helps to:
- Enhance reusability and gain maximum value from data preparation efforts.
- Improve the ability to provide consistent data to multiple teams.
- Reduce the level of effort required by other content creators.
- Achieve scale and performance.
The advanced data preparation usage scenario expands on the self-service data preparation scenario. Advanced data preparation is about increasing dataflow reuse by multiple users across various teams and for various use cases.
Separate workspaces, organized by dataflow purpose, are helpful when dataflow output is provided to multiple semantic model creators, especially when they are on different teams in the organization. Separate workspaces are also helpful for managing security roles when the people who create and manage dataflows are different from the people consume them.
Note
The advanced data preparation scenario is the second of the data preparation scenarios. This scenario builds upon what can be done with centralized dataflows as described in the self-service data preparation scenario.
The advanced data preparation scenario is one of the self-service BI scenarios. However, a centralized team member can use the techniques in a similar way to what's described in the managed self-service BI scenario. For a complete list of the self-service scenarios, see the Power BI usage scenarios article.
For brevity, some aspects described in the content collaboration and delivery scenarios topic aren't covered in this article. For complete coverage, read those articles first.
Scenario diagram
Tip
We recommend that you review the self-service data preparation usage scenario if you're not familiar with it. The advanced self-service data preparation scenario builds upon that scenario.
The focus of this advanced data preparation scenario is on:
- The use of separate dataflows based on purpose: staging, transformation, or final. We recommend using composable building blocks to obtain greater reuse, in various combinations, to support specific user requirements. Composable building blocks are described later in this article.
- The use of separate workspaces that support dataflow creators or dataflow consumers. Data modelers, who consume dataflows, might be on different teams and/or have different use cases.
- The use of linked tables (also known as linked entities), computed tables (also known as computed entities), and the enhanced compute engine.
Note
Sometimes the terms semantic model and data model are used interchangeably. Generally, from a Power BI service perspective, it's referred to as semantic model. From a development perspective, it's referred to as a data model (or model for short). In this article, both terms have the same meaning. Similarly, a semantic model creator and a data modeler have the same meaning.
The following diagram depicts a high-level overview of the most common user actions and Power BI components that support the advanced data preparation scenario.
Tip
We encourage you to download the scenario diagram if you'd like to embed it in your presentation, documentation, or blog post—or print it out as a wall poster. Because it's a Scalable Vector Graphics (SVG) image, you can scale it up or down without any loss of quality.
The scenario diagram depicts the following user actions, tools, and features:
Item | Description |
---|---|
The dataflow creator develops a collection of tables within a dataflow. For a dataflow that's intended for reuse, it's common (but not required) that the creator belongs to a centralized team that supports users across organizational boundaries (such as IT, enterprise BI, or the Center of Excellence). | |
The dataflow connects to data from one or more data sources. | |
Some data sources may require an On-premises data gateway or VNet gateway for data refresh, like those that reside within a private organizational network. These gateways are used both for authoring the dataflow in Power Query Online and refreshing the dataflow. | |
All of the workspaces involved have their license mode set to Fabric capacity, Premium capacity, Premium Per User, or Embedded. These license modes allow for the use of linked tables and computed tables across workspaces, which are required in this scenario. | |
Dataflow creators develop dataflows by using Power Query Online, which is a web-based version of Power Query. | |
A staging dataflow is created in a workspace that's dedicated to the centralized management of dataflows. A staging dataflow copies the raw data as-is from the source. Few, if any, transformations are applied. | |
A transformation dataflow (also known as a cleansed dataflow) is created in the same workspace. It sources data by using linked table(s) to the staging dataflow. Computed table(s) include transformation steps that prepare, cleanse, and reshape the data. | |
Dataflow creators have access to manage content in the workspace that's dedicated to the centralized management of dataflows. | |
One or more other workspaces exist that are intended to provide access to the final dataflow, which delivers production-ready data to data models. | |
The final dataflow is created in a workspace available to data modelers. It sources data by using linked table(s) to the transformation dataflow. Computed table(s) represent the prepared output that's visible to data modelers who are granted the workspace viewer role. | |
Semantic model creators (who consume the dataflow output) have viewer access to the workspace that contains the final dataflow output. Dataflow creators also have access to manage and publish content in the workspace (not depicted in the scenario diagram). | |
Semantic model creators use the final dataflow as a data source when developing a data model in Power BI Desktop. When ready, the semantic model creator publishes the Power BI Desktop file (.pbix) that contains the data model to the Power BI service (not depicted in the scenario diagram). | |
Fabric administrators manage settings in the Admin portal. | |
In the Admin portal, Power BI administrators can set up Azure connections to store dataflow data in their Azure Data Lake Storage Gen2 (ADLS Gen2) account. Settings include assigning a tenant-level storage account and enabling workspace-level storage permissions. | |
By default, dataflows store data by using internal storage that's managed by the Power BI service. Optionally, data output by the dataflow can be stored in the organization's ADLS Gen2 account. | |
Fabric administrators oversee and monitor activity in the Fabric portal. |
Key points
The following are some key points to emphasize about the advanced data preparation scenario.
Dataflows
A dataflow comprises a collection of tables (also known as entities). Each table is defined by a query, which contains the data preparation steps required to load the table with data. All work to create a dataflow is done in Power Query Online. You can create a dataflow in multiple products, including Power Apps, Dynamics 365 Customer Insights, and Power BI.
Note
You can't create dataflows in a personal workspace in the Power BI service.
Types of dataflows
Use of composable building blocks is a design principle that allows you to manage, deploy, and secure system components, and then use them in various combinations. Creating modular, self-contained dataflows that are specific to a purpose is a best practice. They help to achieve data reuse and enterprise scale. Modular dataflows are also easier to manage and test.
Three types of dataflows are shown in the scenario diagram: staging dataflow, transformation dataflow, and final dataflow.
Staging dataflow
A staging dataflow (sometimes called a data extraction dataflow) copies raw data as-is from the source. Having the raw data extracted with minimal transformation means that downstream transformation dataflows (described next) can use the staging dataflow as their source. This modularity is useful when:
- Access to a data source is restricted to narrow time windows and/or to a few users.
- Temporal consistency is desired to ensure that all downstream dataflows (and related semantic models) deliver data that was extracted from the data source at the same time.
- Reducing the number of queries submitted to the data source is necessary due to source system restrictions or its ability to support analytical queries.
- A copy of the source data is useful for reconciliation processes and data quality verifications.
Transformation dataflow
A transformation dataflow (sometimes called a cleansed dataflow) sources its data from linked tables that connect to the staging dataflow. It's a best practice to separate out transformations from the data extraction process.
A transformation dataflow includes all the transformation steps required to prepare and restructure the data. However, there's still a focus on reusability at this layer to ensure the dataflow is suitable for multiple use cases and purposes.
Final dataflow
A final dataflow represents the prepared output. Some additional transformations might occur based on the use case and purpose. For analytics, a star schema table (dimension or fact) is the preferred design of the final dataflow.
Computed tables are visible to data modelers that are granted the workspace viewer role. This table type is described in the types of dataflow tables topic below.
Note
Data lakes often have zones, like bronze, silver, and gold. The three types of dataflows represent a similar design pattern. To make the best possible data architecture decisions, give thought to who will maintain the data, the expected use of the data, and the skill level required by people accessing the data.
Workspaces for dataflows
If you were to create all dataflows in a single workspace, it would significantly limit the extent of reusability. Using a single workspace also limits the security options available when supporting multiple types of users across teams and/or for different use cases. We recommend using multiple workspaces. They provide better flexibility when you need to support self-service creators from various areas of the organization.
The two types of workspaces shown in the scenario diagram include:
- Workspace 1: It stores centrally managed dataflows (sometimes referred to as a backend workspace). It contains both the staging and transformation dataflows because they're managed by the same people. Dataflow creators are often from a centralized team, such as IT, BI, or the Center of Excellence. They should be assigned to either the workspace admin, member, or contributor role.
- Workspace 2: It stores and delivers the final dataflow output to consumers of the data (sometimes referred to as a user workspace). Semantic model creators are often self-service analysts, power users, or citizen data engineers. They should be assigned to the workspace viewer role because they only need to consume the output of the final dataflow. To support semantic model creators from various areas of the organization, you can create numerous workspaces like this one, based on use case and security needs.
Tip
We recommend reviewing ways to support semantic model creators as described in the self-service data preparation usage scenario. It's important to understand that semantic model creators can still use the full capabilities of Power Query within Power BI Desktop. They can choose to add query steps to further transform the dataflow data or merge the dataflow output with other sources.
Types of dataflow tables
Three types of dataflow tables (also known as entities) are depicted in the scenario diagram.
- Standard table: Queries an external data source, such as a database. In the scenario diagram, standard tables are depicted in the staging dataflow.
- Linked table: References a table from another dataflow. A linked table doesn't duplicate the data. Rather, it allows the reuse of a standard table multiple times for multiple purposes. Linked tables aren't visible to workspace viewers since they inherit permissions from the original dataflow. In the scenario diagram, linked tables are depicted twice:
- In the transformation dataflow for accessing the data in the staging dataflow.
- In the final dataflow for accessing the data in the transformation dataflow.
- Computed table: Performs additional computations by using a different dataflow as its source. Computed tables allow customizing the output as needed for individual use cases. In the scenario diagram, computed tables are depicted twice:
- In the transformation dataflow for performing common transformations.
- In the final dataflow for delivering output to semantic model creators. Since computed tables persist the data again (after the dataflow refresh), data modelers can access the computed tables in the final dataflow. In this case, data modelers should be granted access with the workspace viewer role.
Note
There are many design techniques, patterns, and best practices that can take dataflows from self-service to enterprise-ready. Also, dataflows in a workspace that has its license mode set to Premium per user or Premium capacity can benefit from advanced features. Linked tables and computed tables (also known as entities) are two advanced features that are essential for increasing the reusability of dataflows.
Enhanced compute engine
The enhanced compute engine is an advanced feature available with Power BI Premium.
Important
At times this article refers to Power BI Premium or its capacity subscriptions (P SKUs). Be aware that Microsoft is currently consolidating purchase options and retiring the Power BI Premium per capacity SKUs. New and existing customers should consider purchasing Fabric capacity subscriptions (F SKUs) instead.
For more information, see Important update coming to Power BI Premium licensing and Power BI Premium FAQ.
The enhanced compute engine improves performance of linked tables (within the same workspace) that reference (link to) the dataflow. To get maximum benefit from the enhanced compute engine:
- Split out the staging and transformation dataflows.
- Use the same workspace to store the staging and transformation dataflows.
- Apply complex operations that can query fold early in the query steps. Prioritizing foldable operations can help to achieve the best refresh performance.
- Use incremental refresh to reduce refresh durations and resource consumption.
- Perform testing early and frequently during the development phase.
Dataflow and semantic model refresh
A dataflow is a source of data for semantic models. In most cases, multiple data refresh schedules are involved: one for each dataflow and one for each semantic model. Alternatively, it's possible to use DirectQuery from the semantic model to the dataflow, which requires Power BI Premium and the enhanced compute engine (not depicted in the scenario diagram).
Azure Data Lake Storage Gen2
An ADLS Gen2 account is a specific type of Azure storage account that has the hierarchical namespace enabled. ADLS Gen2 has performance, management, and security advantages for operating analytical workloads. By default, Power BI dataflows use internal storage, which is a built-in data lake account managed by the Power BI service. Optionally, organizations can bring their own data lake by connecting to an ADLS Gen2 account in their organization.
Here are some advantages of using your own data lake:
- Users (or processes) can directly access the dataflow data stored in the data lake. That's helpful when dataflow reuse occurs beyond Power BI. For example, Azure Data Factory could access the dataflow data.
- Other tools or systems can manage the data in the data lake. In this case, Power BI could consume the data rather than manage it (not depicted in the scenario diagram).
When using linked tables or computed tables, make sure that each workspace is assigned to the same ADLS Gen2 storage account.
Note
Dataflow data in ADLS Gen2 is stored within a Power BI-specific container. This container is depicted in the self-service data preparation usage scenario diagram.
Admin portal settings
There are two important settings to manage in the Admin portal:
- Azure connections: The Azure connections section of the Admin portal includes a setting to set up a connection to an ADLS Gen2 account. This setting allows a Power BI administrator to bring your own data lake to dataflows. Once configured, workspaces can use that data lake account for storage.
- Workspace-level storage: A Power BI administrator can set workspace-level storage permissions. When enabled, the setting allows workspace administrators to use a different storage account to the one set at tenant-level. Enabling this setting is helpful for decentralized business units that manage their own data lake in Azure.
Gateway setup
Typically, an On-premises data gateway is required for connecting to data sources that reside within a private organizational network or a virtual network.
A data gateway is required when:
- Authoring a dataflow in Power Query Online that connects to private organizational data.
- Refreshing a dataflow that connects to private organizational data.
Tip
Dataflows require a centralized data gateway in standard mode. A gateway in personal mode isn't supported when working with dataflows.
System oversight
The activity log records user activities that occur in the Power BI service. Power BI administrators can use the activity log data that's collected to perform auditing to help them understand usage patterns and adoption. The activity log is also valuable for supporting governance efforts, security audits, and compliance requirements. In the advanced data preparation scenario, the activity log data is helpful to track the management and use of dataflows.
Related content
For other useful scenarios to help you with Power BI implementation decisions, see the Power BI usage scenarios article.