Data quality scans review your data assets based on their applied data quality rules and produce a score. Your data stewards can use that score to assess the data health and address any issues that might be lowering the quality of your data.
Prerequisites
To run and schedule data quality assessment scans, your users must be in the data quality steward role.
Currently, the Microsoft Purview account can be set to allow public access or managed vNet access so that data quality scans can run.
Fabric data estate in OneLake is including shortcut and mirroring data estate. Data quality scanning is supported only for Lakehouse delta tables and parquet files.
Mirroring data estate: CosmosDB, Snowflake, Azure SQL
Shortcut data estate: AWS S3, GCS, AdlsG2, and Dataverse
Azure Synapse serverless and data warehouse
Azure Databricks Unity Catalog
Snowflake
Google Big Query (preview)
Iceberg data in ADLS Gen2, Microsoft Fabric Lakehouse, AWS S3, and GCP GCS
Dôležité
Data quality for Parquet file is designed to support:
A directory with Parquet Part File. For example: ./Sales/{Parquet Part Files}. The Fully Qualified Name must follow https://(storage account).dfs.core.windows.net/(container)/path/path2/{SparkPartitions}. Make sure we don't have {n} patterns in directory/sub-directory structure, must rather be a direct FQN leading to {SparkPartitions}.
A directory with Partitioned Parquet Files, partitioned by Columns within the dataset like sales data partitioned by year and month. for example: ./Sales/{Year=2018}/{Month=Dec}/{Parquet Part Files}.
Both of these essential scenarios, which present a consistent parquet dataset schema, are supported. Limitation: It isn't designed to or won't support N arbitrary Hierarchies of Directories with Parquet Files. We recommend presenting data in (1) or (2) constructed structure.
Supported authentication methods
Currently, Microsoft Purview can only run data quality scans using Managed Identity as authentication option. Data quality services run on Apache Spark 3.4 and Delta Lake 2.4. For more information about supported regions, see data quality overview.
Dôležité
If the schema is updated on the data source, it's necessary to rerun data map scan before running a data quality scan. You can also use schema import feature from Data quality overview page.
Schema import isn't supported for data sources running on managed vNet or private end point.
vNet isn't supported for Azure Databricks, Google BigQuery, and Snowflake
From Microsoft Purview Unified Catalog, select Health Management, then select Data quality.
Select a governance domain from the list.
Select a data product to assess the data quality of the data assets linked to that product.
Select View detail page, which takes you to the data product's data quality overview page. You can browse the existing data quality rules and add new rules by selecting Rules. You can browse schema of the data asset by selecting Schema.
Browse the rules that already added to the scan for the selected assets, and toggle them on or off in the Status column.
Run the quality scan by selecting Run quality scan on the overview page.
Although data quality scans can be run on an ad-hoc basis by selecting the Run quality scan button, in production scenarios it's likely that the source data is being constantly updated and, so we want to make sure we're regularly monitoring its data quality in order to detect any issues. To enable us to manage regularly updating quality scans we can automate the scanning process.
From Microsoft Purview Unified Catalog, select Health Management, then select Data quality.
Select a governance domain from the list.
Select Manage, then select Scheduled scans.
Fill out the form on the Create scheduled scan page. Add a name and description for the source you're setting up the schedule.
Select Continue.
On the Scope tab, select individual data product and assets or all data products and data assets of the entire governance domain.
Select Continue.
Set a schedule based on your preferences and select Continue.
On the Review tab, select Save (or Save and run to test immediately) to complete scheduling the data quality assessment scan.
When you remove a data asset from a data product, if that data asset has a data quality score, you'll first need to delete the data quality score, then remove the data asset from the data product.
When you delete data quality history data, it removes the profile history, the data quality scan history, and data quality rules, but data quality actions won't be deleted.
Follow the steps below to delete previous data quality scans:
From Microsoft Purview Unified Catalog, select Health Management menu and Data quality submenu.
Select a governance domain from the list.
Select the ellipsis (...) at the top right of the page.
Select Delete data quality data to delete the history of data quality runs.
Poznámka
We recommend only using Delete data quality data for test runs, errored data quality runs, or if you're removing a data asset from a data product.
We store up to 50 snapshots of data quality profiling and data quality assessment history. If you want to delete a specific snapshot, select the desired history run and select the delete icon.
This training module guides you in how to build a complete master data management and data governance stack end to end with Microsoft Purview and CluedIn. It includes developing golden records, deduplication, data lineage, and data quality strategies.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.