Share via


Azure Databricks architecture in Microsoft Cloud for Sovereignty

Our reference architecture includes the following features:

Alignment with Sovereign Landing Zone

When you deploy Azure Databricks using a medallion architecture, you need to consider enforcement of Azure management policies.

The Sovereign Landing Zone, which is a variant of the Azure Landing Zone, has four potential management groups for application workloads. You can place subscriptions under (Confidential Corp, Confidential Online, Corp, and Online).

For an Azure Databricks deployment, you need to create a Data Landing Zone subscription under the Corp management group. The Sovereign Baseline policy doesn't allow you to deploy Azure Databricks to Confidential Corp and Confidential online management group. Azure Databricks isn't on the allowed list of applications to be deployed in those management groups, as they require all parts of the service to use Azure Confidential Computing.

Screenshot of the Azure Databricks data landing zone.

Design principles

This reference architecture focuses on Azure Databricks Lakehouse best practices for networking, encryption, security, storage, and compute layers. These practices can help the government and public sector industry choose the relevant configuration options depending on their risk appetite.

Architecture dataflow

The following diagram shows the architecture components that are critical to deploy the Azure Databricks architecture within either a Sovereign Landing Zone or Azure Landing Zone, both with Sovereignty Baseline policies enabled.

The image shows Azure Databricks reference architecture data flow.

The key stages/dataflow are as follows:

Ingest

Azure Event Hubs is a big-data streaming platform. As a platform as a service (PaaS), this event ingestion service is fully managed. Azure Event Hubs ingests raw streaming data into Azure Databricks.

Azure Data Factory is a hybrid data integration service. You can use this fully managed, serverless solution to create, schedule, and orchestrate data transformation workflows. Azure Data Factory loads raw batch data into Azure Data Lake Storage Gen2.

The analytical platform ingests data from the disparate batch and streaming sources. Data scientists use this data for data preparation and exploration, and model preparation and training.

Process

Azure Databricksis a data analytics platform. Its fully managed Spark clusters process large streams of data from multiple sources. Azure Databricks cleans and transforms structureless data sets. It combines processed data with structured data from operational databases or data warehouses. Azure Databricks also trains and deploys scalable machine learning and deep learning models.

Azure Databricks Serverless SQL warehouses provide on-demand elastic compute services used to run SQL commands on data objects in the SQL editor or interactive notebooks. You can create SQL warehouses using UI, CLI, or REST API.

Services that work with the data connect to a single underlying data source to ensure consistency. For instance, you can run SQL queries on the data lake with Azure Databricks SQL Serverless service. This service:

  • Provides a query editor and catalog, the query history, basic dashboarding, and alerting.

  • Uses integrated security that includes row-level and column-level permissions.

  • Uses a Photon-powered Delta Engine to accelerate performance.

Store

Azure Databricks works well with a medallion architecture, which organizes data into the following layers:

  • Bronze: Holds raw data and history

  • Silver: Contains cleaned, filtered, and augmented data

  • Gold: Stores aggregated data used for business analytics

Azure Data Lake Storage Gen2 is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data is structured, semi-structured, or unstructured. It typically comes from multiple, heterogeneous sources like logs, files, and media. Two Data Lake Storage Gen2 houses data of all types, such as structured, unstructured, and semi-structured. It also stores batch and streaming data.

  • Data Lake Storage 1 houses the Bronze layer

  • Data Lake Storage 2 houses the Silver and Gold layers

Delta Lake is a storage layer that uses an open file format. It runs on top of cloud storage such as Data Lake Storage Gen2. Delta Lake supports data versioning, rollback, and transactions for updating, deleting, and merging data.

Serve

Microsoft Power BI is a collection of software services and apps. These services create and share reports that connect and visualize unrelated sources of data. Together with Azure Databricks, Power BI can provide root cause determination and raw data analysis. Power BI generates analytical and historical reports and dashboards from the unified data platform. Microsoft Power BI has a built-in Azure Databricks connector for visualizing the underlying data.

Monitor and govern

The architecture uses various Azure services for collaboration, performance, reliability, governance, and security:

Microsoft Purview manages on-premises, multicloud, and software as a service (SaaS) data. This governance service maintains data landscape maps. Features include automated data discovery, sensitive data classification, and data lineage. Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.

Azure Databricks Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces.

Azure DevOps is a DevOps orchestration platform. This SaaS provides tools and environments for building, deploying, and collaborating on applications. Azure DevOps offers continuous integration and continuous deployment (CI/CD) and other integrated version control features.

Azure Key Vault stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates.

Microsoft Entra ID offers cloud-based identity and access management services and provides a way for users to sign in and access resources. Microsoft Entra ID provides single sign-on (SSO) for Azure Databricks users. Azure Databricks supports automated user provisioning with Microsoft Entra ID for creating new users, assigning each user an access level and removing users and denying them access.

Azure Monitor collects and analyzes data on environments and Azure resources. This data includes app telemetry, such as performance metrics and activity logs. By proactively identifying problems, this service maximizes performance and reliability.

Microsoft Cost Management  provides financial governance services for Azure workloads. It helps you manage cloud spending. By using budgets and recommendations, this service helps you organize expenses and reduce costs.

Azure Private Link provides private connectivity between users and their Databricks workspaces, and between clusters on the classic compute plane, the core services on the control plane within the Databricks workspace infrastructure and Azure Data Lake Storage.

Note

Ensure you check for any additional policies that might impact the deployment of services. Deployment can also fail if you don't have permission to create Azure Private Links.