Data governance best practices

This article describes the need for data governance and shares best practices and strategies you can use to implement these techniques across your organization.

Why is data governance important?

Data governance is the oversight to ensure that data brings value and supports your business strategy. Data governance encapsulates the policies and practices implemented to securely manage the data assets within an organization. As the amount and complexity of data are growing, more and more organizations are looking at data governance to ensure the core business outcomes:

  • Consistent and high data quality as a foundation for analytics and machine learning.
  • Reduced time to insight.
  • Data democratization, that is enabling everybody in an organization to make data-driven decisions.
  • Support for risk and compliance for industry regulations such as HIPAA, FedRAMP, GDPR, or CCPA.
  • Cost optimization, for example by preventing users to start up large clusters and creating guardrails for using expensive GPU instances.

What does a good data governance solution look like?

Data-driven companies typically build their data architectures for analytics on the lakehouse. A data lakehouse is an architecture that enables efficient and secure data engineering, machine learning, data warehousing, and business intelligence directly on vast amounts of data stored in data lakes. Data governance for a data lakehouse provides the following key capabilities:

  • Unified catalog: A unified catalog stores all your data, ML models, and analytics artifacts, in addition to metadata for each data object. The unified catalog also blends in data from other catalogs such as an existing Hive metastore.
  • Unified data access controls: A single and unified permissions model across all data assets and all clouds. This includes attribute-based access control (ABAC) for personally identifiable information (PII).
  • Data isolation: Data isolation can be achieved at multiple levels–environment, storage location, data objects of increasing granularity–without losing the ability to manage access and auditing centrally.
  • Data auditing: Data access is centrally audited with alerts and monitoring capabilities to promote accountability.
  • Data quality management: Robust data quality management with built-in quality controls, testing, monitoring, and enforcement to ensure accurate and useful data is available for downstream BI, analytics, and machine learning workloads.
  • Data lineage: Data lineage to get end-to-end visibility into how data flows in lakehouse from source to consumption.
  • Data discovery: Easy data discovery to enable data scientists, data analysts, and data engineers to quickly discover and reference relevant data and accelerate time to value.
  • Data sharing: Data can be shared across clouds and platforms.

Data governance and Azure Databricks

Azure Databricks provides centralized governance for data and AI with Unity Catalog and Delta Sharing.

  • Unity Catalog is a fine-grained governance solution for data and AI on the Databricks Lakehouse. It helps simplify security and governance of your data by providing a central place to administer and audit data access.
  • Delta Sharing is an open protocol developed by Databricks for secure data sharing with other organizations, or with other teams within your organization, regardless of which computing platforms they use.

For best practices on adopting Unity Catalog and Delta Sharing, see Unity Catalog best practices.

Legacy data governance solutions

  • Table access control is a legacy data governance model that lets you programmatically grant and revoke access to objects managed by your workspace’s built-in Hive metastore. Databricks recommends that you use Unity Catalog instead of table access control. Unity Catalog simplifies security and governance of your data by providing a central place to administer and audit data access across multiple workspaces in your account.

  • Azure Data Lake Storage credential passthrough (legacy) is also a legacy data governance feature that allows you authenticate automatically to Azure Storage from Azure Databricks clusters using the same Azure Active Directory identity that you use to log into Azure Databricks. Databricks recommends that you use Unity Catalog instead.

Identity configuration

Every good data governance story starts with a strong identity foundation. To learn how to best configure identity in Azure Databricks, see Identity best practices.

Learn more

Here are some resources to help you build a comprehensive data governance solution that meets your organization’s needs:

  • Get started using Unity Catalog, for step-by-step instructions for setting up Unity Catalog for your organization.
  • The Databricks Security and Trust Center, which provides information about the ways in which security is built into every layer of the Databricks Lakehouse Platform.
  • Secret management, for information on how to use Databricks secrets to store your credentials and reference them in notebooks and jobs. You should never hard code secrets or store them in plain text.