Data profiling is the process of examining the data available in different data sources and collecting statistics and information about this data. Data profiling helps to assess the quality level of the data according to defined set of goals. If data is of a poor quality, or managed in structures that can't be integrated to meet the needs of the
enterprise, business processes and decision-making suffer. Data profiling allows you to understand the trustworthiness and quality of your data, which is a prerequisite for
making data-driven decisions that boost revenue and foster growth.
Prerequisites
To run and schedule data quality assessment scans, your users must be in the data quality steward role.
Currently, the Microsoft Purview account can be set to allow public access or managed vNet access so data quality scans can run.
Fabric data estate in OneLake is including shortcut and mirroring data estate. Data profiling is supported only for Lakehouse delta tables and parquet files.
Mirroring data estate: Cosmos DB, Snowflake, Azure SQL
Shortcut data estate: AWS S3, GCS, AdlsG2, and Dataverse
Azure Synapse serverless and data warehouse
Azure Databricks Unity Catalog
Snowflake
Google Big Query (preview)
Iceberg data in ADLS Gen2, Microsoft Fabric Lakehouse, AWS S3, and GCP GCS
Important
Data quality for Parquet file is designed to support:
A directory with Parquet Part File. For example: ./Sales/{Parquet Part Files}. The Fully Qualified Name must follow https://(storage account).dfs.core.windows.net/(container)/path/path2/{SparkPartitions}. Make sure not to have {n} patterns in directory/sub-directory structure; it must be a direct FQN leading to {SparkPartitions}.
A directory with Partitioned Parquet Files, partitioned by Columns within the dataset like sales data partitioned by year and month. For example: ./Sales/{Year=2018}/{Month=Dec}/{Parquet Part Files}.
Both of these essential scenarios, which present a consistent Parquet dataset schema, are supported. Limitation: It isn't designed to or won't support N arbitrary Hierarchies of Directories with Parquet files. We recommend presenting data in (1) or (2) constructed structure.
Supported authentication methods
Currently, Microsoft Purview can only run data quality scans using Managed Identity as authentication option. Data quality services run on Apache Spark 3.4 and Delta Lake 2.4. For more information about supported regions, see data quality overview.
Important
If the schema is updated on the data source, it's necessary to rerun data map scan before running a data profiling. You can import schema from data quality overview page using schema import feature. If your data source is running on managed vNet or in Private end point, then schema import feature isn't supported.
vNet isn't supported for Azure Databricks, Google BigQuery, and Snowflake.
In the current version, you can profile 50 columns per batch. If your data asset has more than 50 columns, you can profile extra columns in more batches.
If a column contains distinct value, we recommend not profiling that column. A column with distinct values isn't able to create a normal distribution.
From Microsoft Purview Unified Catalog, select Health Management, then select Data quality.
Select a governance domain from the list.
Select a data product to profile a data asset linked to that product.
Select a data asset to navigate into data quality Overview page for profiling.
Select Profile button to run profiling job for the selected data asset.
The AI recommendation engine suggests potentially important columns to run data profiling against. You can deselect recommended columns and/or select more columns to be profiled.
Once you've selected the relevant columns, select Run Profile.
When the job is complete, select the Profile tab from left menu of the asset's data quality page to list browse the profiling result and statistical snapshot. There could be several profile result pages depending on how many columns your data assets have.
Browse the profiling results and statistical measures for each column.
This training module guides you in how to build a complete master data management and data governance stack end to end with Microsoft Purview and CluedIn. It includes developing golden records, deduplication, data lineage, and data quality strategies.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Get an overview of data quality rules in Microsoft Purview Unified Catalog, and how you can use them to increase the quality and trustworthiness of your data.
This article gives an overview of how data quality stewards can monitor data quality profiling and scanning jobs in the Microsoft Purview Unified Catalog.
This article provides information about how to manage data quality for an organization's critical data elements in the Microsoft Purview Unified Catalog.