Data quality for databricks Unity Catalog databases
To use Unity Catalog, your Azure Databricks workspace must be enabled for Unity Catalog, which means that the workspace is attached to a Unity Catalog metastore. All new workspaces are enabled for Unity Catalog automatically upon creation, but older workspaces might require that an account admin enable Unity Catalog manually. Whether or not your workspace was enabled for Unity Catalog automatically, the following steps are also required to get started with Unity Catalog:
- Create catalogs and schemas to contain database objects like tables and volumes.
- Create managed storage locations to store the managed tables and volumes in these catalogs and schemas.
- Grant user access to catalogs, schemas, and database objects.
Workspaces that are automatically enabled for Unity Catalog provision a workspace catalog with broad privileges granted to all workspace users. This catalog is a convenient starting point for trying out Unity Catalog.
For detailed setup instructions, see Set up and manage Unity Catalog.
When scanning Azure Databricks Unity Catalog, Microsoft Purview supports:
- Metastore
- Catalogs
- Schemas
- Tables including the columns
- Views including the columns
When setting up scan, you can choose to scan the entire Unity Catalog, or scope the scan to a subset of catalogs.
Configure datamap scan to catalog Databricks Unity Catalog data in Microsoft Purview
- Register an Azure Databricks workspace in Microsoft Purview
- Scan registered Azure Databricks workspace
- Enter the name of scan
- Select unity catalog as extraction method
- Connect via integration runtime (Azure integration runtime, Managed VNet IR, or a Kubernetes supported self-hosted integration runtime you created)
- Select Access Token Authentication while creating a credential. For more information, see Credentials for source authentication in Microsoft Purview.
- Specify the Databricks SQL Warehouse’s HTTP path that Microsoft Purview will connect to and perform the scan
- In Scope your scan page, select the catalogs you want to scan.
- Select a scan rule set for classification. You can choose between the system default, existing custom rule sets, or create a new rule set inline. Check the Classification article to learn more.
- For Scan trigger, choose whether to set up a schedule or run the scan once.
- Review your scan and select Save and Run.
- View your scans and scan run to complete cataloging your data.
Once scanned, the data asset in Unity Catalog (UC) will be available on the data catalog search. For more details about how to connect and manage Azure Databricks Unity Catalog in Microsoft Purview, follow this document.
Important
- Select Access Token Authentication while creating a credential.
- Place Access Token on your hosted Azure Key Vault and connect the key vault to the connection manager.
- Make sure to provide product (service) MSI read (secret) access to the Key Vault.
Set up connection to databricks UC for data quality scan
At this point we have the scanned asset ready for cataloging and governance. Associate the scanned asset to the Data Product in a Governance Domain Sele. At the Data Quality Tab, add a new Azure SQL Database Connection: Get the Database Name entered manually.
Select Data quality > Governance Domain > Manage tab to create connection.
Configure connection in the connection page.
- Add connection name and description
- select source type Azure Databricks
- select workspace URL
- select Unity catalog as extraction method
- select HTTP path
- select unity catalog name
- select schema name
- select table name
- select authentication method - Access Token
- Add azure subscription
- Key vault connection
- secret name
- secret version
Test connection
Important
- Data Quality stewards need read only access to Azure databrics Unity Catalog to setup data quality connection.
Profiling and Data Quality scanning for data in Azure Databricks Unity Catalog databases.
After completed connection setup successfully, you can profile, create and apply rules, and run DQ scan of your data in Azure databricks Unity Catalog databases. Follow the step-by-step guideline described in below documents: