DR for Azure Data Platform - Architecture

Azure Synapse Analytics
Azure Machine Learning
Azure Cosmos DB
Azure Data Lake
Azure Event Hubs

Use case definition

To support this worked example, the fictitious firm “Contoso” will be used with an Azure Data Platform based upon Microsoft Reference Architectures.

Data Service - Component View

Contoso has implemented the following foundational Azure structure, which is a subset of the Enterprise Landing Zone. Diagram that shows an example Enterprise Azure landing zone.

The numbers in the following descriptions correspond to the preceding diagram above.

Contoso’s Azure Foundations - Workflow

  1. Enterprise Enrollment - Contoso’s top parent enterprise enrollment within Azure reflecting its commercial agreement with Microsoft, its organizational account structure and available Azure subscriptions. It provides the billing foundation for subscriptions and how the digital estate is administered
  2. Identity and Access Management – The components required to provide identity, authentication, resource access and authorization services across Contoso’s Azure footprint
  3. Management Group and Subscription Organization - A scalable group hierarchy aligned to the data platform’s core capabilities, allowing operationalization at scale using centrally managed security and governance where workloads have clear separation. Management groups provide a governance scope above subscriptions
  4. Management Subscription - A dedicated subscription for the various management level functions of required to support the data platform
  5. Connectivity Subscription - A dedicated subscription for the connectivity functions of the data platform enabling it to identify named services, determine secure routing and communication across and between internal and external services
  6. Landing Zone Subscription – One-to-many subscriptions for Azure native, online applications, internal and external facing workloads and resources
  7. DevOps Platform - The DevOps Platform that supports the Azure foundation & Data Platform. This platform contains the code base source control repository and CI/CD pipelines enabling automated deployments of IaC

Note

Many customers still retain a large IaaS footprint. To provide recovery capabilities across IaaS, the key component to be added is Azure Site recovery. Site Recovery will orchestrate and automate the replication of Azure VMs between regions, on-premises virtual machines and physical servers to Azure, and on-premises machines to a secondary datacenter.

Within this foundational structure, Contoso has implemented the following elements to support its enterprise business intelligence needs, aligned to the guidance in Analytics end-to-end with Azure Synapse.

Diagram that shows architecture for a modern data platform using Azure data services. Contoso's data platform

Contoso’s Data Platform - Workflow

The workflow is read left to right, following the flow of data:

  • Data Sources - The sources or types of data that the data platform can consume from
  • Ingest - The Platform’s capability to ingest data from various sources of varying structure and speed. This design reflects a Lambda architecture
  • Store - The capability to securely store data at scale that has been ingested onto the platform
  • Process - The Platform’s capability to process data, making it “fit for purpose” for downstream processes like cleansing, standardizing and modeling. The pre-processing of data typically ensures that it's in a “position and a condition, ready for use”
  • Enrich - The capability to enhance data processed on the platform via statistical, Machine Learning or other modeling techniques or prebuilt Azure AI Services
  • Serve - The Platform’s capability to shape and present data for downstream consumption
  • Data Consumers - The individuals, applications or downstream processes that consume data from the platforms’ various serving touchpoints
  • Discover and Govern - The Platform’s capabilities to govern the data it contains and ensure it's indexed, discoverable/searchable, well-described, with full lineage and is transparent to its end users and consuming processes.
  • Platform - The foundation upon which the platform is built, that is, Contoso’s Azure Foundations as described above.

Note

For many customers, the conceptual level of the Data Platform reference architecture used will align, but the physical implementation may vary. For example, ELT (extract, load, transform) processes may be performed through Azure Data Factory, and data modeling by Azure SQL server. To address this concern, the Stateless vs Stateful section below will provide guidance.

For the Data Platform, Contoso has selected the lowest recommended production service tiers for all components and has chosen to adopt a “Redeploy on Disaster” DR strategy based upon an operating cost-minimization approach.

The following sections will provide a baseline understanding of the DR process and levers available to customers to uplift this posture.

Azure service and component view

The following tables present a breakdown of each Azure service and component used across the Contoso – Data platform, with options for DR uplift.

Note

The sections below are organized by stateful vs stateless services

Stateful Foundational Components

  • Microsoft Entra ID including role entitlements

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Premium P1
    • DR Uplift options: Microsoft Entra ID’s resiliency is part of its SaaS offering
    • Notes
  • Azure Key Vault

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Recovery Services Vault

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Default (GRS)
    • DR Uplift options: Enabling Cross Region Restore creates data restoration in the secondary, paired region
    • Notes
      • While LRS and ZRS are available, it requires configuration activities from the default setting
  • Azure DevOps

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: DevOps Services
    • DR Uplift options: DevOps service and data resiliency is part of its SaaS offering
    • Notes
      • DevOps Server as the on-premises offering will remain the customer’s responsibility for disaster recovery
      • If third party services (SonarCloud, Jfrog Artifactory, Jenkins build servers for example) are used, they'll remain the customer’s responsibility for recovery from a disaster
      • If IaaS VMs are used within the DevOps toolchain, they'll remain the customer’s responsibility for recovery from a disaster

Stateless Foundational Components

  • Subscriptions

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Management Groups

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Azure Monitor

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Cost Management

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Microsoft Defender for Cloud

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Azure DNS

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Single Zone - Public
    • DR Uplift options: N/A, DNS is highly available by design
  • Network Watcher

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A, Covered as part of the Azure Service
  • Virtual Networks, including Subnets, UDR & NSGs

    • Component Recovery Responsibility: Contoso
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: N/A
    • DR Uplift options: VNETs can be replicated into the secondary, paired region
  • Azure Firewall

    • Component Recovery Responsibility: Contoso
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Standard
    • DR Uplift options: Azure Firewall is highly available by design and can be created with Availability Zones for increased availability
  • Azure DDoS

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: DDoS Network Protection
    • DR Uplift options: N/A, covered as part of the Azure service
  • ExpressRoute Circuit

    • Component Recovery Responsibility: Contoso, connectivity partner and Microsoft
    • Workload/Configuration Recovery Responsibility: Connectivity partner and Microsoft
    • Contoso SKU selection: Standard
    • DR Uplift options:
    • Notes
      • The ExpressRoute has inbuilt redundancy, with each circuit consisting of two connections to two Microsoft Enterprise edge routers (MSEEs) at an ExpressRoute Location from the connectivity provider/client's network edge
      • ExpressRoute premium circuit will enable access to all Azure regions globally
  • VPN Gateway

    • Component Recovery Responsibility: Contoso
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Single Zone - VpnGw1
    • DR Uplift options: A VPN Gateway can be deployed into an Availability Zone with the VpnGw#AZ SKUs to provide a zone redundant service
  • Azure Load Balancer

    • Component Recovery Responsibility: Contoso
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Standard
    • DR Uplift options:
    • Notes
      • Azure Traffic Manager is a DNS-based traffic load balancer. This service supports the distribution of traffic for public-facing applications across the global Azure regions. This solution will provide protection from a regional outage within a high availability design

Stateful Data platform-specific services

  • Storage Account: Azure Data Lake Gen2

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: LRS
    • DR Uplift options: Storage Accounts have a broad range of data redundancy options from primary region redundancy up to secondary region redundancy
    • Notes
      • GRS is recommended to uplift redundancy, providing a copy of the data in the paired region
  • Azure Event Hubs

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Standard
    • DR Uplift options: An event hub namespace can be created with availability zones enabled. This resiliency can be extended to cover a full region outage with Geo-disaster recovery
    • Notes
      • By design, Event Hubs geo-disaster recovery doesn't replicate data, therefore there are several considerations to keep in mind for failover and fallback
  • Azure IoT Hubs

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Standard
    • DR Uplift options:
    • Notes
      • IoT Hub provides Microsoft-Initiated Failover and Manual Failover by replicating data to the paired region for each IoT hub
      • IoT Hub provides Intra-Region HA and will automatically use an availability zone if created in a predefined set of Azure regions
  • Azure Stream Analytics

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Standard
    • DR Uplift options: While Azure Stream Analytics is a fully managed PaaS offering, it doesn't provide automatic geo-failover. Geo-redundancy can be achieved by deploying identical Stream Analytics jobs in multiple Azure regions
  • Azure Machine Learning

  • Power BI

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Power BI Pro
    • DR Uplift options: N/A, Power BI’s resiliency is part of its SaaS offering
    • Notes
  • Azure Cosmos DB

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Single Region Write with Periodic backup
    • DR Uplift options:
      • Single-region accounts may lose availability following a regional outage. Resiliency can be uplifted to a single write region and at least a second (read) region and enable Service-Managed failover
      • It's recommended that Azure Cosmos accounts used for production workloads to enable automatic failover. In the absence of this configuration, the account will experience loss of write availability for all the duration of the write region outage, as manual failover won't succeed due to lack of region connectivity
    • Notes
  • Azure Data Share

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: N/A
    • DR Uplift options: Azure Data Share’s resiliency can be uplifted by HA deployment into a secondary region
  • Microsoft Purview

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: N/A
    • DR Uplift options: N/A
    • Notes

Stateless Data platform-specific services

  • Azure Synapse: Pipelines

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Computed Optimized Gen2
    • DR Uplift options: N/A, Synapse resiliency is part of its SaaS offering using the automatic failover feature
    • Notes
      • If Self-Hosted Data Pipelines are used, they'll remain the customer’s responsibility for recovery from a disaster
  • Azure Synapse: Data Explorer Pools

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Computed Optimized, Small (4 cores)
    • DR Uplift options: N/A, Synapse resiliency is part of its SaaS offering
    • Notes
  • Azure Synapse: Spark Pools

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Computed Optimized, Small (4 cores)
    • DR Uplift options: N/A, Synapse resiliency is part of its SaaS offering
    • Notes
  • Azure Synapse: Serverless and Dedicated SQL Pools

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Contoso
    • Contoso SKU selection: Computed Optimized Gen2
    • DR Uplift options: N/A, Synapse resiliency is part of its SaaS offering
    • Notes
      • Azure Synapse Analytics automatically takes snapshots throughout the day to create restore points that are available for seven days
      • Azure Synapse Analytics performs a standard geo-backup once per day to a paired data center. The RPO for a geo-restore is 24 hours
      • If Self-Hosted Data Pipelines are used, they'll remain the customers responsibility recovery from a disaster
  • Azure AI services (formerly Cognitive Services)

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Pay As You Go
    • DR Uplift options: N/A, the APIs for AI services are hosted by Microsoft-managed data centers
    • Notes
      • If AI services has been deployed via customer deployed Docker containers, recovery remains the responsibility of the customer
  • Azure AI Search (formerly Cognitive Search)

    • Component Recovery Responsibility: Microsoft
    • Workload/Configuration Recovery Responsibility: Microsoft
    • Contoso SKU selection: Standard S1
    • DR Uplift options:
    • Notes
      • In AI Search business continuity (and disaster recovery) is achieved through multiple AI Search services.
      • there's no built-in mechanism for disaster recovery. If continuous service is required during a catastrophic failure, the recommendation is to have a second service in a different region, and implementing a geo-replication strategy to ensure indexes are fully redundant across all services

Stateful vs Stateless Components

The speed of innovation across the Microsoft product suite and Azure, in particular, means the component set that we've used for this worked example will quickly evolve. To future-proof against providing stale guidance and extend this guidance to components not explicitly covered in this document, the section below provides some instruction based upon the coarse-grain classification of state.

A component/service can be described as stateful if It's designed to remember preceding events or user interactions. Stateless means there's no record of previous interactions, and each interaction request has to be handled based entirely on information that comes with it.

For a DR scenario that calls for redeployment:

  • Components/services that are “stateless”, like Azure Functions and Azure Data Factory pipelines, can be redeployed from source control with at least a smoke test to validate availability before being introduced into the broader system
  • Components/services that are “stateful”, like Azure SQL database and storage accounts, require more attention
    • When procuring the component, a key decision will be selecting the data redundancy feature. This decision typically focuses on a trade-off between availability and durability with operating costs
  • Datastores will also need a data backup strategy. The data redundancy functionality of the underlying storage mitigates this risk for some designs, while others, like SQL databases will need a separate backup process.
    • If necessary, the component can be redeployed from source control with a validated configuration via a smoke-test
    • A redeployed datastore must have its dataset rehydrated. Rehydration can be accomplished through data redundancy (when available) or a backup dataset. When rehydration has been completed, it must be validated for accuracy and completeness
      • Depending on the nature of the backup process, the backup datasets may require validation before being applied. Backup process corruption/error may result in an earlier backup being used in place of the latest version available
    • Any delta between the component date/timestamp and the current date should be addressed by re-executing or replaying the data ingestion processes from that point forward
    • Once the component's dataset is up to date, it can be introduced into the broader system

Other key services

This section contains HA/DR guidance for other key Azure Data components and services.

Next steps

Now that you've learned about the scenario's architecture, you can learn about the scenario details