Setup data source connection to connect data sources for data quality assessment

Data source connections set up the authentication needed to profile your data for statistical snapshot, or scan your data for data quality anomalies and scoring.

Setting up data source connections is the fourth step in the data quality life cycle for a data asset. Previous steps are:

  1. Assign users(s) data quality steward permissions in your data catalog to use all data quality features.
  2. Register and scan a data source in your Microsoft Purview Data Map.
  3. Add your data asset to a data product

Prerequisites

  1. To create connections to data assets, your users must be in the data quality steward role.
  2. You must have the Azure Owner or User Access Administrator role on the Azure resources.

Here are workarounds if you would not want to grant Azure Resource Owner role or Users Access Administrator role to all data quality stewards:

Workaround 1: Your IT admin, who has the Azure Resource Owner role or User Access Administrator role, can create a data source connection for Data Quality (DQ). This is a one-time configuration task. The IT admin only needs the Data Quality Steward role temporarily to create the DQ connection. After completing DQ connection configuration, the Data Quality Steward role can be removed from the IT admin's personal, as there's no use of that role for an IT Admin personal.

Workaround 2: Your company can grant the Azure Resource Owner role to one or two data stewards who are accountable and responsible for creating data source connections for Data Quality assessment and data profiling.

Supported multi-cloud data sources

  • Azure Data Lake Storage Gen2
    • File Types: Delta Parquet and Parquet
  • Azure SQL Database
  • Fabric data estate in OneLake including shortcut and mirroring data estate. Data Quality scanning is supported only for Lakehouse delta tables and parquet files.
    • Mirroring data estate: CosmosDB, Snowflake, Azure SQL
    • Shortcut data estate: AWS S3, GCS, AdlsG2
  • Azure Synapse serverless and data warehouse
  • Azure Databricks Unity Catalog
  • Snowflake
  • Google Big Query (Private Preview)

Currently, Microsoft Purview can only run data quality scans using Managed Identity as authentication option. Data Quality services run on Apache Spark 3.4 and Delta Lake 2.4.

Important

To access these sources, either you need to set your Azure Storage sources to have an open firewall, to Allow Trusted Azure Services, or to use private endpoints and a data quality managed virtual network.

Setup data source connection

  1. From Microsoft Purview Data Catalog, select the Health Management menu and Data quality submenu.

  2. Select a governance domain from the list

  3. Select the Manage button and select Connections from the menu to open connections page.

    Screenshot of the connections page in Microsoft Purview Data Quality.

  4. Select New tab to create a new connection for the data products and data assets of your governance domain.

    Screenshot of the set up connection page in Microsoft Purview Data Quality.

  5. In the right panel, enter the following information:

    • Display name
    • Description
  6. Select Source type, and select one of the data sources.

  7. Depending on the data source, enter the access details.

  8. If the test connection is successful, then Submit the connection configuration to complete the connection setup.

Tip

You can also create a connection to your resources using private endpoints and a Microsoft Purview Data Quality managed virtual network. For more information, see the managed virtual network article.

Grant Microsoft Purview permissions on the source

Now that the connection is created, to be able to scan data sources, your Microsoft Purview managed identity will need permissions on your data sources:

Next steps

  1. Configure and run data profiling for an asset in your data source.
  2. Set up data quality rules based on the profiling results, and apply them to your data asset.
  3. Configure and run a data quality scan on a data product to assess the quality of all supported assets in the data product.
  4. Review your scan results to evaluate your data product's current data quality.