When scanning Google BigQuery source, Microsoft Purview supports:
Extracting technical metadata including:
Projects and datasets
Tables including the columns
Views including the columns
Fetching static lineage on assets relationships among tables and views.
When setting up scan, you can choose to scan an entire Google BigQuery project. You can also scope the scan to a subset of datasets matching the given names, or name patterns.
Known limitations
Currently, Microsoft Purview only supports scanning Google BigQuery datasets in US multi-regional location. If the specified dataset is in other location for example, us-east1 or EU, you observe scan completes but no assets shown up in Microsoft Purview.
When object is deleted from the data source, currently the subsequent scan isn't automatically removed the corresponding asset in Microsoft Purview.
Configure datamap scan to catalog Google BigQuery data in Microsoft Purview
Register a Google BigQuery project
Open the Microsoft Purview, and select Data Map on the left navigation.
Select Register.
On Register sources, select Google BigQuery. Select Continue.
Enter a Name that the data source will be listed within the Catalog.
Enter the ProjectID. This should be a fully qualified project ID. For example, mydomain.com: myProject
Select a collection from the list.
Select Register.
Set up a datamap scan for Google BigQuery project
Make sure a self-hosted integration runtime is set up. If it isn't set up, use the steps mentioned in prerequisites
Navigate to Sources.
Select the registered BigQuery project.
Select + New scan.
Provide the below details:
Name: The name of the scan
Connect via integration runtime: Select the configured self-hosted integration runtime
Credential: While configuring BigQuery credential, make sure to:
Select Basic Authentication as the Authentication method
Provide the email ID of the service account in the User name field. For example, xyz@developer.gserviceaccount.com
Follow below steps to generate the private key. Copy the entire JSON key file and store it as the value of a Key Vault secret.
To create a new private key from Google's cloud platform:
In the navigation menu, select IAM (Identity Access Management), and select Admin --> Service Accounts --> Select a project -->
Select the email address of the service account that you want to create a key for.
Select the Keys tab.
Select the Add key drop-down menu, then select Create new key.
Choose JSON format.
Specify the path to the JDBC (Java Database Connectivity) driver location in your machine where self-host integration runtime is running. For example: D:\Drivers\GoogleBigQuery.
Specify a list of BigQuery datasets to import. For example, dataset1;dataset2. When the list is empty, all available datasets are imported.
Maximum memory (in GB) available on your VM (Virtual Machine) to be used by scanning processes. This is dependent on the size of Google BigQuery project to be scanned.
Select Test connection.
Select Continue.
Choose your scan trigger. You can set up a schedule or ran the scan once.
Review your scan and select Save and Run.
Once scanned, the data assets in Google BigQuery project will be available on the Unified Catalog search. For more details about how to connect and manage Google BigQuery in Microsoft Purview, follow this document.
Važno
Deleting your scan does not delete catalog assets created from previous scans.
Set up connection to Google BigQuery project for data quality scan
At this point we have the scanned asset ready for cataloging and governance. Associate the scanned assets to the Data Products in a Governance Domain to set up Data Quality scan.
Select Data quality > Governance Domain > Manage tab to create connection.
Configure connection
Add connection name and description
Select source type Google BigQuery
Add Project ID, Dataset name, and Table name
Select Service account private key
Add Azure subscription
Key vault connection
secret name
secret version
Test connection to make the data source connection is successfully configured.
Važno
Data Quality stewards need read only access to Google BigQuery to setup data quality connection.
vNet and Private end point is not supported for Google BigQuery data source yet for data quality scanning service.
Profiling and data quality scanning for data in Google BigQuery
After completed connection setup successfully, you can profile, create and apply rules, and run Data Quality scan of your data in Google BigQuery. Follow the step-by-step guideline described in below documents:
This training module guides you in how to build a complete master data management and data governance stack end to end with Microsoft Purview and CluedIn. It includes developing golden records, deduplication, data lineage, and data quality strategies.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.
Get an overview of data quality rules in Microsoft Purview Unified Catalog, and how you can use them to increase the quality and trustworthiness of your data.