This reference architecture illustrates how to design infrastructure for highly available virtualized and containerized workloads in Remote Office/Branch Office (ROBO) scenarios.
Download a Visio file of this architecture.
The architecture incorporates the following capabilities:
- Azure Stack HCI (20H2). Azure Stack HCI is a hyper-converged infrastructure (HCI) cluster solution that hosts virtualized Windows and Linux workloads and their storage in a hybrid on-premises environment. The stretched cluster can consist of between four and 16 physical nodes.
- File share witness. A file share witness is a Server Message Block (SMB) share that Failover Cluster uses as a vote in the cluster quorum. Starting with Windows Server 2019, it's possible to use a USB drive connected to a router for this purpose.
- Azure Arc. A cloud-based service that extends the Azure Resource Manager–based management model to non-Azure resources including virtual machines (VMs), Kubernetes clusters, and containerized databases.
- Azure Policy. A cloud-based service that evaluates Azure and on-premises resources through integration with Azure Arc by comparing properties to customizable business rules.
- Azure Monitor. A cloud-based service that maximizes the availability and performance of your applications and services by delivering a comprehensive solution for collecting, analyzing, and acting on telemetry from your cloud and on-premises environments.
- Microsoft Defender for Cloud. Microsoft Defender for Cloud is a unified infrastructure security management system that strengthens the security posture of your data centers, and provides advanced threat protection across your hybrid workloads in the cloud - whether they're in Azure or not - and on premises.
- Azure Automation. Azure Automation delivers a cloud-based automation and configuration service that supports consistent management across your Azure and non-Azure environments.
- Change Tracking and Inventory. A feature of Azure Automation that tracks changes in Windows Server and Linux servers hosted in Azure, on-premises, and other cloud environments to help you pinpoint operational and environmental issues with software managed by the Distribution Package Manager.
- Update Management. A feature of Azure Automation that streamlines management of OS updates for Windows Server and Linux machines in Azure, in on-premises environments, and in other cloud environments.
- Azure Backup. The Azure Backup service provides simple, secure, and cost-effective solutions to back up your data and recover it from the Microsoft Azure cloud.
- Azure Site Recovery. A cloud-based service that helps ensure business continuity by keeping business apps and workloads running during outages. Site Recovery manages replication and failover of workloads running on both physical and virtual machines between their primary site and a secondary location.
- Azure File Sync. A cloud-based service that can synchronize and cache content of Azure file shares, by using Windows Servers across your Azure and non-Azure environments.
- Storage Replica. A Windows Server technology that enables replication of volumes between servers or clusters for disaster recovery.
Key technologies used to implement this architecture:
- Azure Site Recovery
- Azure Arc
- Azure Backup
- Azure Container Registry
- Azure Files
- Azure Monitor
- Azure Policy
- Microsoft Defender for Cloud
Potential use cases
Typical uses for this architecture include the following Remote Office/Branch Office (ROBO) scenarios:
- Implement highly available, container-based edge workloads and virtualized, business-essential applications in a cost-effective manner.
- Lower total cost of ownership (TCO) through Microsoft-certified solutions, cloud-based automation, centralized management, and centralized monitoring.
- Control and audit security and compliance by using virtualization-based protection, certified hardware, and cloud-based services.
The following recommendations apply for most scenarios. Follow these recommendations unless you have a specific requirement that overrides them.
Use Azure Stack HCI switchless interconnect and lightweight quorum for highly available and cost-effective ROBO infrastructure.
In ROBO scenarios, a primary business concern is minimizing costs. Yet many ROBO workloads are of utmost criticality with very little tolerance for downtime. Azure Stack HCI offers the optimal solution by offering both resiliency and cost-effectiveness. Using Azure Stack HCI, you can apply built-in resiliency of Storage Spaces Direct and Failover Clustering technologies to implement highly available compute, storage, and network infrastructure for containerized and virtualized ROBO workloads. For cost-effectiveness, you can use as few as two cluster nodes with only four disks and 64 gigabytes (GB) of memory per node. To further minimize costs, you can use switchless interconnects between nodes, thereby eliminating the need for redundant switch devices. To finalize cluster configuration, you can implement a file share witness simply by using a USB drive connected to a router that hosts uplinks from cluster nodes. For maximum resiliency, on a 2-node cluster you have the option of configuring Storage Spaces Direct volumes with either nested two-way mirror, or nested mirror accelerated parity. Unlike the traditional two-way mirroring, these options tolerate multiple simultaneous hardware failures without data loss.
With nested resiliency, a 2-node cluster and all of its volumes will remain online following a failure of a single node and a single disk on the surviving node.
Fully integrate Azure Stack HCI deployments with Azure to minimize TCO in ROBO scenarios.
As part of the Azure Stack product family, Azure Stack HCI is inherently dependent on Azure. Therefore, to optimize features and support, you must register it within 30 days of deploying your first Azure Stack HCI cluster. This process generates a corresponding Azure Resource Manager resource, which effectively extends the Azure management plane to Azure Stack HCI, and automatically enabling Azure portal-based monitoring, support, and billing functionality.
To minimize Azure Stack HCI cluster and workload management overhead, you should also consider uses the following Azure services, which provide the following capabilities:
- Azure Monitor. Collects telemetry generated by clusters and their VMs for monitoring, analytics, and alerting.
- Azure Automation, Update Management feature. Use for Azure Stack HCI VM automated patch deployment and reporting.
- Azure Automation, Change Tracking, and Inventory feature. Track Azure Stack HCI VM configuration changes.
- Azure Automation DSC. Automate a desired state configuration of Azure Stack HCI VMs.
- Azure Backup. Manage the backup of Azure Stack HCI VMs and their workloads.
- Azure Site Recovery. Implement and orchestrate disaster recovery for Azure Stack HCI VMs.
- Azure File Sync. Synchronize and tier file shares that are hosted on Azure Stack HCI clusters.
- Azure Kubernetes Service (AKS). Implement container orchestration.
To further benefit from Azure capabilities, you can extend the scope of Azure Arc integration to the Azure Stack HCI virtualized and containerized workloads, by implementing the following functionality:
- Azure Arc enabled servers. Use for virtualized workloads that run Azure Stack HCI VMs.
- Azure Arc enabled data services. Use for containerized Azure SQL Managed Instance or PostgresSQL Hyperscale that's running on AKS and hosted by Azure Stack HCI VMs.
AKS on Azure Stack HCI and Azure Arc enabled data services are in preview, at the time of publishing this reference architecture.
With the scope of Azure Arc extended to Azure Stack HCI VMs, you'll be able to automate their configuration by using Azure VM extensions and evaluate their compliance with industry regulations and corporate standards by using Azure Policy.
Leverage Azure Stack HCI virtualization-based protection, certified hardware, and cloud-based services to enhance security and compliance stance in ROBO scenarios.
ROBO scenarios present unique challenges with security and compliance. With no—or at best—limited local IT support and lack of dedicated datacenters, it's particularly important to protect their workloads from both internal and external threats. Azure Stack HCI's capabilities and its integration with Azure services can address this problem.
Azure Stack HCI–certified hardware ensures built-in Secure Boot, Unified Extensible Firmware Interface (UEFI), and Trusted Platform Module (TPM) support. These technologies, combined with virtualization-based security (VBS), help protect security-sensitive workloads. BitLocker Drive Encryption allows you to encrypt Storage Spaces Direct volumes at rest while SMB encryption provides automatic encryption in transit, facilitating compliance with standards such as Federal Information Processing Standard 140-2 (FIPS 140-2) and Health Insurance Portability and Accountability Act (HIPAA).
In addition, you can onboard Azure Stack HCI VMs in Microsoft Defender for Cloud to activate cloud-based behavioral analytics, threat detection and remediation, alerting, and reporting. Similarly, by onboarding Azure Stack HCI VMs in Azure Arc, you gain the ability to use Azure Policy to evaluate their compliance with industry regulations and corporate standards.
The Microsoft Azure Well-Architected Framework is a set of guiding tenets that are followed in this reference architecture. The following considerations are framed in the context of these tenets.
Reliability ensures your application can meet the commitments you make to your customers. For more information, see Overview of the reliability pillar.
Reliability considerations include:
- Improved Storage Spaces Direct volume repair speed (also referred to as resync). Storage Spaces Direct provides automatic resync following events that affect availability of storage pool disks, such as shutting down a cluster node or a localized hardware failure. Azure Stack HCI implements an enhanced resync process that operates at much finer granularity than Windows Server 2019 and significantly reduces the resync operation time. This minimizes potential impact of multiple overlapping hardware failures.
- Failover Clustering witness selection. The lightweight, USB drive–based witness eliminates dependencies on reliable internet connectivity, which is required when using cloud witness-based configuration.
Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.
Security considerations include:
- Azure Stack HCI basic security. Leverage Azure Stack HCI hardware components (such as Secure Boot, UEFI, and TPM) to build a secure foundation for Azure Stack HCI VM-level security, including Device Guard and Credential Guard. Use Windows Admin Center role-based access control to delegate management tasks by following the principle of least privilege.
- Azure Stack HCI advanced security. Apply Microsoft security baselines to Azure Stack HCI clusters and their Windows Server workloads by using Active Directory Domain Services (AD DS) with Group Policy. You can use Microsoft Advanced Threat Analytics (ATA) to detect and remediate cyber threats targeting AD DS domain controllers providing authentication services to Azure Stack HCI clusters and their Windows Server workloads.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
Cost optimization considerations include:
- Switchless vs switch-based cluster interconnects. The switchless interconnect topology consists of redundant connections between single-port or dual-port Remote Direct Memory Access (RDMA) adapters on each node (which forms a full mesh), with each node connected directly to every other node. While this is straightforward to implement in a 2-node cluster, larger clusters require additional network adapters in each node's hardware.
- Cloud-style billing model. Azure Stack HCI pricing follows the monthly subscription billing model, with a flat rate per physical processor core in an Azure Stack HCI cluster.
While there are no on-premises software licensing requirements for cluster nodes hosting the Azure Stack HCI infrastructure, Azure Stack HCI VMs might require individual OS licenses. Additional usage charges might also apply if you use other Azure services.
Operational excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Overview of the operational excellence pillar.
Operational excellence considerations include:
- Simplified provisioning and management experience with Windows Admin Center. The Create Cluster Wizard in Windows Admin Center provides a wizard-driven interface that guides you through creating an Azure Stack HCI cluster. Similarly, Windows Admin Center simplifies the process of managing Azure Stack HCI VMs.
- Automation capabilities. Azure Stack HCI provides a wide range of automation capabilities, with OS updates combined with full-stack updates including firmware and drivers provided by Azure Stack HCI vendors and partners. With Cluster-Aware Updating (CAU), OS updates run unattended while Azure Stack HCI workloads remain online. This results in seamless transitions between cluster nodes that eliminate impact from post-patching reboots. Azure Stack HCI also offers support for automated cluster provisioning and VM management by using Windows PowerShell. You can run Windows PowerShell locally from one of the Azure Stack HCI servers or remotely from a management computer. Integration with Azure Automation and Azure Arc facilitates a wide range of additional automation scenarios for virtualized and containerized workloads.
- Decreased management complexity. Switchless interconnect eliminates the risk of switch device failures and the need for their configuration and management.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. For more information, see Performance efficiency pillar overview.
Performance efficiency considerations include:
- Storage resiliency versus usage efficiency, versus performance. Planning for Azure Stack HCI volumes involves identifying the optimal balance between resiliency, usage efficiency, and performance. The challenge results from the fact that maximizing one of these characteristics typically has a negative impact on at least one of the other two. For example, increasing resiliency reduces the usable capacity, while the resulting performance might vary depending on the resiliency type. In the case of nested two-way mirror volumes or nested mirror accelerated parity volumes, higher resiliency leads to lower capacity efficiency compared to traditional two-way mirroring. At the same time, the nested two-way mirror volume offers better performance than the nested mirror accelerated parity volume, but at the cost of lower usage efficiency.
- Storage Spaces Direct disk configuration. Storage Spaces Direct supports hard disk drives (HDDs), solid-state drives (SSDs), and NVMe drive types. The drive type directly impacts storage performance due to differences in performance characteristics between each type, and the caching mechanism, which is an integral part of Storage Spaces Direct configuration. Depending on the Azure Stack HCI workloads and budget constraints, you can choose to maximize performance, maximize capacity, or implement a drive configuration that provides balance between performance and capacity.
- Storage caching optimization. Storage Spaces Direct provides a built-in, persistent, real-time, read and write, server-side cache that maximizes storage performance. The cache should be sized and configured to accommodate the working set of your applications and workloads. In addition, Azure Stack HCI is compatible with the Cluster Shared Volume (CSV) in-memory read cache. Using system memory to cache reads can improve Hyper-V performance.
- Compute performance optimization. Azure Stack HCI offers support for graphics processing unit (GPU) acceleration, targeting high-performance AI/ML workloads that are geared towards edge scenarios.
- Networking performance optimization. As part of your design, be sure to include projected traffic bandwidth allocation when determining your optimal network hardware configuration. This includes provisions addressing switchless interconnect minimum bandwidth requirements.
- About Site Recovery
- Azure Automation State Configuration overview
- Azure Kubernetes Service
- Azure Monitor overview
- Change Tracking and Inventory overview
- Manage registered servers with Azure File Sync
- Update Management overview
- What are Azure Arc-enabled data services?
- What is Azure Arc-enabled servers?
- What is the Azure Backup service?
Microsoft Learn modules:
- Configure Azure files and Azure File Sync
- Configure Azure Monitor
- Design your site recovery solution in Azure
- Introduction to Azure Arc enabled servers
- Introduction to Azure Arc-enabled data services
- Introduction to Azure Kubernetes Service
- Keep your virtual machines updated
- Protect your virtual machine settings with Azure Automation State Configuration
- Protect your virtual machines by using Azure Backup