Azure Well-Architected Framework perspective on Azure Stack HCI

Azure Stack HCI is a hyperconverged infrastructure (HCI) platform that provides local storage, network resources, and compute resources. You can use Azure Stack HCI to create and manage Windows and Linux virtual machines (VMs), Kubernetes clusters for containerized workloads, and other Azure Arc-enabled services. The platform uses Azure for streamlined deployment and management, which provides a unified and consistent management experience through Azure Arc. You can use Azure Stack HCI and Azure Arc capabilities to keep business systems and application data on-premises to address data sovereignty, regulation and compliance, and latency requirements.

This article assumes you have an understanding of hybrid systems and have working knowledge of Azure Stack HCI. The guidance in this article provides architectural recommendations that are mapped to the principles of the Azure Well-Architected Framework pillars.

Important

How to use this guide

Each section has a design checklist that presents architectural areas of concern along with design strategies localized to the technology scope.

Also included are recommendations on the technology capabilities that can help materialize those strategies. The recommendations don't represent an exhaustive list of all configurations available for Azure Stack HCI and its dependencies. Instead, they list the key recommendations mapped to the design perspectives. Use the recommendations to build your proof-of-concept or optimize your existing environments.

Foundational architecture that demonstrates the key recommendations:
Azure Stack HCI baseline reference architecture.

Technology scope

This review focuses on the interrelated decisions for the following Azure resources:

  • Azure Stack HCI (platform), version 23H2 and later
  • Azure Arc VMs (workload)

Note

This article covers the preceding scope and provides checklists and recommendations that are organized by platform architecture and workload architecture. Platform concerns are the responsibility of the platform administrators. Workload concerns are the responsibility of the workload operator and application developers. These roles and responsibilities are distinct and can be owned by separate teams or individuals. Keep that distinction in mind when you apply the guidance.

This guidance doesn't focus on specific resource types that you can deploy on Azure Stack HCI, such as Azure Arc VMs, Azure Kubernetes Service (AKS), and Azure Virtual Desktop. When you deploy these resource types on Azure Stack HCI, refer to the respective workload guidance to design solutions that meet your business requirements.

Reliability

The purpose of the Reliability pillar is to provide continued functionality by building enough resilience and the ability to recover fast from failures.

The Reliability design principles provide a high-level design strategy applied for individual components, system flows, and the system as a whole.

In hybrid cloud deployments, the goal is to reduce the effects of one component failure. Use these design checklists and configuration suggestions to lessen the impact of a component failure for workloads that you deploy on Azure Stack HCI.

It's important to distinguish between platform reliability and workload reliability. Workload reliability has a dependency on the platform. Application owners or developers must design applications that can deliver the defined reliability targets.

Design checklist

Start your design strategy based on the design review checklist for Reliability. Determine its relevance to your business requirements while keeping in mind the performance of Azure Stack HCI. Extend the strategy to include more approaches as needed.

  • (Azure Stack HCI platform architecture and workload architecture) Define workload reliability targets.

    • Set your service-level objectives (SLOs) so that you can evaluate availability targets. Calculate SLOs as a percentage, such as 99.9%, 99.95%, or 99.995%, that reflects workload uptime. Keep in mind that this calculation isn't based only on the platform metrics that the Azure Stack HCI cluster or workload emits. To get a comprehensive target measurement, factor in nuanced factors that are quantified, such as expected downtime during releases, routine operations, supportability, or other workload-specific or organizational-specific factors.

    • Microsoft-provided service-level agreements (SLAs) often influence SLO calculations. But Microsoft doesn't provide an SLA for the uptime and connectivity of Azure Stack HCI clusters or the deployed workload, because Microsoft doesn't control the customer datacenter reliability (such as power and cooling) or the people and processes that administer the platform.

  • (Azure Stack HCI platform architecture) Consider how performance and operations affect reliability.

    Degraded performance of the cluster or its dependencies can make the Azure Stack HCI platform unavailable. For example:

    • Without proper workload capacity planning, it's challenging to rightsize Azure Stack HCI clusters in the design phase, which is a requirement so that the workload can meet the desired reliability targets. Use the Azure Stack HCI sizer tool during cluster design. Consider the "N+1 minimum requirement for number of nodes" if you require highly available VMs. For business-critical or mission-critical workloads, consider using a "N+2 number of nodes" for the cluster size if resiliency is paramount.

    • The reliability of the platform depends on how well the critical platform dependencies, such as physical disk types, perform. You must choose the right disk types for your requirements. For workloads that need low-latency and high-throughput storage, we recommend an all-flash (NVMe/SSD only) storage configuration. For general purpose compute, a hybrid storage (NVMe or SSDs for cache and HDDs for capacity) configuration might provide more storage space. But the tradeoff is that spinning disks have significantly lower performance if your workload exceeds the cache working set, and HDDs have a much lower mean time between failure value compared to NVMe/SSDs.

      Performance Efficiency describes these examples in more detail.

    Improper Azure Stack HCI operations can affect patching and upgrades, testing, and consistency of deployments. Here are some examples:

    • If the Azure Stack HCI platform doesn't evolve with the latest hardware original equipment manufacturer (OEM) firmware, drivers, and innovations, the platform might not take advantage of the latest resiliency features. Apply hardware OEM driver and firmware updates regularly. For more information, see Azure Stack HCI solution catalog.

    • You must test the target environment for connectivity, hardware, and identity and access management before your deployment. Otherwise, you might deploy the Azure Stack HCI solution to an unstable environment, which can create reliability problems. You can use the environmental checker tool in standalone mode to detect problems, even before the cluster hardware is available.

      For operational guidance, see Operational Excellence.

  • (Azure Stack HCI platform architecture) Provide fault tolerance to the cluster and its infrastructure dependencies.

    • Storage design choices. For most deployments, the default option to "automatically create workload and infrastructure volumes" is sufficient. If you select the advanced option: "create required infrastructure volumes only", configure the appropriate volume fault tolerance within Storage Spaces Direct based on your workload requirements. These decisions influence the performance, capacity, and resiliency capabilities of the volumes. For example, a three-way mirror increases reliability and performance for clusters with three or more nodes. For more information, see Fault tolerance for storage efficiency and Create Storage Spaces Direct virtual disks and volumes.

    • Network architecture. Use a validated network topology to deploy Azure Stack HCI. Multi-node clusters, with four or more physical nodes, require the "storage switched" design. Clusters with two or three nodes can optionally use the "storage switchless" design. Regardless of the cluster size, we recommend that you use dual top of rack (ToR) switches for the management and compute intents (north and south uplinks) to provide increased fault tolerance. The dual ToR topology also provides resiliency during switch servicing (firmware update) operations. For more information, see Validated network topologies.

  • (Workload architecture) Build redundancy to provide resiliency.

    • Consider a workload that you deploy on a single Azure Stack HCI cluster as a locally redundant deployment. The cluster provides high availability at the platform level, but you must remember that you deploy the cluster "in a single rack". Therefore, for business-critical or mission-critical use cases, we recommend that you deploy multiple instances of a workload or service across two or more separate Azure Stack HCI clusters, ideally in separate physical locations.

    • Use industry-standard high-availability patterns for workloads, for example a design that provides active/passive synchronous or asynchronous data replication (such as SQL Server Always On). Another example is an external network load balancing (NLB) technology that can route user requests across the multiple workload instances that run on Azure Stack HCI clusters that you deploy in separate physical locations. Consider using a partner external NLB device. Or evaluate the load balancing options that support traffic routing for hybrid and on-premises services, such as an Azure Application Gateway instance that uses Azure ExpressRoute or a VPN tunnel to connect to an on-premises service.

      For more information, see Deploy workloads instances across multiple Azure Stack HCI clusters.

  • (Workload architecture) Plan and test recoverability based on your workload recovery point objective (RPO) and recovery time objective (RTO) targets.

    Have a well-documented disaster recovery plan. Test the recovery steps regularly to ensure that your business continuity plans and processes are valid. Determine whether Azure Site Recovery is a viable choice for protecting VMs that run on Azure Stack HCI. For more information, see Protect VM workloads with Azure Site Recovery on Azure Stack HCI (preview).

  • (Workload architecture) Configure and regularly test workload backup and restore procedures.

    Business requirements for data recovery and retention drive the strategy for workload backups. A comprehensive strategy includes considerations for workload operating system (OS) and application persistent data, with the ability to restore individual (point-in-time) file-level and folder-level data. Configure the backup retention policies based on your data recovery and compliance requirements, which determine the number and age of available data recovery points. Explore Azure Backup as an option to enable host-level or VM guest-level backups for Azure Stack HCI. Review data protection solutions from Backup independent software vendor partners where relevant. For more information, see Azure Backup guidance and best practices and Azure Backup for Azure Stack HCI.

Recommendations

Recommendation Benefit
Reserve the equivalent of one capacity disk worth of space per node within the Storage Spaces Direct storage pool. If you choose to create workload volumes after you deploy an Azure Stack HCI cluster (Advanced option: "create required infrastructure volumes only"), we recommend that you leave 5% to 10% of the total pool capacity unallocated in the storage pool. This reserved and unused, or free, capacity enables Storage Spaces Direct to repair "in-place" when a physical disk fails, which improves data resiliency and performance if a physical disk failure occurs.
Ensure that all physical nodes have network access to the list of required outbound HTTPS endpoints for Azure Stack HCI and Azure Arc. To reliably manage, monitor, and operate Azure Stack HCI clusters or workload resources, the required outbound network endpoints must have access, either directly or through a proxy server. A temporary interruption doesn't affect the running status of the workload but might affect manageability.
If you opt to create workload volumes (virtual disks) manually, use the most appropriate resiliency type to maximize workload resiliency and performance. For any user volumes that you create manually after you deploy the cluster, create a storage path for the volumes in Azure. The volume can store workload VM configuration files, VM virtual hard disks (VHDs), and VM images via the storage path. For Azure Stack HCI clusters with three or more nodes, consider using a three-way mirror to provide the highest resiliency and performance capabilities. We recommend that you use mirrored volumes for business-critical or mission-critical workloads.
Consider implementing workload anti-affinity rules to ensure that the VMs that host multiple instances of the same service run on separate physical hosts. This concept is similar to "availability sets" in Azure. Make all components redundant. For business-critical or mission-critical workloads, use multiple Azure Arc VMs or Kubernetes replica sets or pods to deploy multiple instances of your applications or services. This approach increases resiliency if an unplanned outage of a single physical node occurs.

Security

The purpose of the Security pillar is to provide confidentiality, integrity, and availability guarantees to the workload.

The Security design principles provide a high-level design strategy for achieving those goals by applying approaches to the technical design of Azure Stack HCI.

Azure Stack HCI is a secure-by-default product that has more than 300 security settings enabled during the cloud deployment process. Default security settings provide a consistent security baseline to ensure that devices start in a known good state. And you can use drift protection controls to provide at-scale management.

Default security features in Azure Stack HCI include hardened OS security settings, Windows Defender Application Control, volume encryption via BitLocker, secret rotation, local built-in user accounts, and Microsoft Defender for Cloud. For more information, see Review security features.

Design checklist

Start your design strategy based on the design review checklist for Security. Identify vulnerabilities and controls to improve the security posture. Extend the strategy to include more approaches as needed.

  • (Azure Stack HCI platform architecture) Review the security baselines. Azure Stack HCI and security standards provide baseline guidance to strengthen the security posture of the platform and hosted workloads. If your workload needs to comply with specific regulatory compliance regulations, factor in the regulatory security standards, such as Payment Card Industry Data Security Standards and Federal Information Processing Standard 140.

    Azure Stack HCI platform-provided default settings enable security features, including identity controls, network filtering, and encryption. These settings form a good security baseline for a newly provisioned Azure Stack HCI cluster. You can customize each setting based on your organizational security requirements.

    Make sure that you detect and protect against undesired security configuration drift.

  • (Azure Stack HCI platform architecture) Detect, prevent, and respond to threats. Continuously monitor the Azure Stack HCI environment and protect against existing and evolving threats.

    We recommend that you enable Defender for Cloud on Azure Stack HCI. Enable the basic Defender for Cloud plan (free tier) by using Defender Cloud Security Posture Management to monitor and identify steps that you can take to secure your Azure Stack HCI platform, along with other Azure and Azure Arc resources.

    To benefit from the enhanced security features, including security alerts for individual servers and Azure Arc VMs, enable Microsoft Defender for Servers on your Azure Stack HCI cluster nodes and Azure Arc VMs.

    • Use Defender for Cloud to measure the security posture of Azure Stack HCI nodes and workloads. Defender for Cloud provides a single pane of glass experience to help manage security compliance.

    • Use Defender for Servers to monitor the hosted VMs for threats and misconfigurations. You can also enable endpoint detection and response capabilities on Azure Stack HCI nodes.

    • Consider aggregating security and threat intelligence data from all sources into a centralized security information and event management (SIEM) solution, such as Microsoft Sentinel.

  • (Azure Stack HCI platform architecture and workload architecture) Create segmentation to contain the blast radius. There are several strategies to attain segmentation.

    • Identity. Keep roles and responsibilities for the platform and workload separate. Allow only authorized identities to carry out the specific operations that align with their designated roles. Azure Stack HCI platform administrators use both Azure and local domain credentials to do platform duties. Workload operators and application developers manage workload security. To simplify delegating permissions, use Azure Stack HCI built-in role-based access control (RBAC) roles, such as 'Azure Stack HCI Administrator' for platform administrators and 'Azure Stack HCI VM Contributor' or 'Azure Stack HCI VM Reader' for workload operators. For more information about specific built-in role actions, see Azure RBAC documentation for hybrid and multicloud roles.

    • Network. Isolate networks if needed. For example, you can provision multiple logical networks that use separate virtual local area networks (vLANs) and network address ranges. When you use this approach, ensure that the management network can reach each logical network and vLAN so that Azure Stack HCI cluster nodes can communicate with the vLAN networks through the ToR switches or gateways. This configuration is required for availability management of the workload, such as allowing infrastructure management agents to communicate with the workload guest OS.

    • Review Recommendations for building a segmentation strategy for additional information.

  • (Azure Stack HCI platform architecture and workload architecture) Use a trusted identity provider to control access. We recommend Microsoft Entra ID for all authentication and authorization purposes. You can join a workload to an on-premises Windows Server Active Directory domain if required. Take advantage of features that support strong passwords, multifactor authentication, RBAC, and controls for the management of secrets.

  • (Azure Stack HCI platform architecture and workload architecture) Isolate, filter, and block network traffic. You might have a workload use case that requires virtual networks, microsegmentation via network security groups, network quality of service policies, or virtual appliance chaining so that you can bring in partner appliances for filtering. If you have such a workload, see software-defined network considerations for network reference patterns for a list of the supported features and capabilities that Network Controller provides.

  • (Workload architecture) Encrypt data to protect against tampering. Encrypt data in transit, data at rest, and data in use.

    • Data-at-rest encryption is enabled on data volumes that you create during deployment. These data volumes include both infrastructure volumes and workload volumes. For more information, see Manage BitLocker encryption.

    • Use trusted launch for Azure Arc VMs to improve security of Gen 2 VMs by using OS features of modern operating systems, such as Secure Boot, which can use a virtual Trusted Platform Module.

  • Operationalize secret management. Based on your organizational requirements, change the credentials that are associated with the deployment user identity for Azure Stack HCI. For more information, see Manage secrets rotation.

  • (Azure Stack HCI platform architecture) Enforce security controls. Use Azure Policy to audit and enforce built-in policies, such as "Application control policies should be consistently enforced" or "Encrypted volumes should be implemented". You can use these Azure policies to audit security settings and assess the compliance status of Azure Stack HCI. For examples of the available policies, see Azure policies.

  • (Workload architecture) Improve workload security posture with built-in policies. To assess Azure Arc VMs that run on Azure Stack HCI, you can apply built-in policies via the security benchmark, Azure Update Manager, or the Azure Policy guest configuration extension. You can use various policies to check the following conditions:

    • Log Analytics agent installation
    • Out-of-date system updates that need to be up to date with the latest security patches
    • Vulnerability assessment and potential mitigations
    • Use of secure communication protocols

Recommendations

Recommendation Benefit
Use the security baseline and drift controls settings to apply and maintain security settings on cluster nodes. These configurations help to protect against unwanted changes and drift because they automatically refresh security settings every 90 minutes to enforce the intended security posture of Azure Stack HCI.
Use Windows Defender Application Control in Azure Stack HCI. Windows Defender Application Control reduces the attack surface of Azure Stack HCI. Use the Azure portal or PowerShell to view policy settings and control policy modes. Windows Defender Application Control policies help to control which drivers and apps are allowed to run on your system.
Enable volume encryption via BitLocker for data encryption-at-rest protection. BitLocker protects OS and data volumes by encrypting the cluster shared volumes that are created on the Azure Stack HCI. BitLocker uses XTS-AES 256-bit encryption. We recommended that you keep the volume encryption default setting enabled during Azure Stack HCI cloud deployment for all data volumes.
Export BitLocker recovery keys to store them in a secure location that's external from the Azure Stack HCI cluster. You might need BitLocker keys during specific troubleshooting or recovery actions. We recommend that you export, save, and back up encrypt keys for OS and data volumes from each Azure Stack HCI cluster via the 'Get-AsRecoveryKeyInfo' PowerShell cmdlet. Save the keys in a secure external location, such as Azure Key Vault.
Use a SIEM solution to increase security monitoring and alerting capabilities. To do so, you can onboard Azure Arc-enabled servers (Azure Stack HCI platform nodes) to Microsoft Sentinel. Alternatively, if you use a different SIEM solution, configure syslog forwarding of security events to the chosen solution. Forward security event data by using Microsoft Sentinel or syslog forwarding to provide alerting and reporting capabilities through integration with a customer-managed SIEM solution.
Use Server Message Block (SMB) signing to enhance data-in-transit protection, which is enabled in the "default security settings." SMB signing allows you to digitally sign SMB traffic between an Azure Stack HCI platform and systems external to the platform (north or south). Configure signing for external SMB traffic between the Azure Stack HCI platform and other systems to help prevent relay attacks.
Use the SMB encryption setting to enhance data-in-transit protection, which is enabled in the "default security settings." The SMB encryption for in-cluster traffic setting controls the encryption of traffic between physical nodes in the Azure Stack HCI cluster (east or west) on your storage network.

Cost Optimization

Cost Optimization focuses on detecting spend patterns, prioritizing investments in critical areas, and optimizing in others to meet the organization's budget while meeting business requirements.

The Cost Optimization design principles provide a high-level design strategy for achieving those goals and making tradeoffs as necessary in the technical design related to Azure Stack HCI and its environment.

Design checklist

Start your design strategy based on the design review checklist for Cost Optimization for investments. Fine-tune the design so that the workload is aligned with the budget that's allocated for the workload. Your design should use the right Azure capabilities, monitor investments, and find opportunities to optimize over time.

Azure Stack HCI incurs costs for hardware, software licensing, workloads, guest VMs (Windows Server or Linux) licensing, and other integrated cloud services, such as Azure Monitor and Defender for Cloud.

  • (Azure Stack HCI platform architecture and workload architecture) Estimate realistic costs as part of cost modeling. Use the Azure pricing calculator to select and configure services like Azure Stack HCI, Azure Arc, and AKS on Azure Stack HCI. Experiment with various configurations and payment options to model costs.

  • (Azure Stack HCI platform architecture and workload architecture) Optimize the cost of Azure Stack HCI hardware. Choose a hardware OEM partner that aligns with your business and commercial requirements. To explore the certified list of validated nodes, integrated systems, and premier solutions, see Azure Stack HCI solutions catalog. Communicate your workload characteristics, size, quantity, and performance with your hardware partner so that you can rightsize a cost-effective hardware solution for the Azure Stack HCI nodes and cluster size.

  • (Azure Stack HCI platform architecture) Optimize your licensing costs. Azure Stack HCI software is licensed and billed on a "per physical CPU core" basis. Use existing on-premises core licenses with Azure Hybrid Benefit to reduce licensing costs for Azure Stack HCI workloads, such as Azure Arc VMs that run Windows Server, SQL Server, or AKS and Azure Arc-enabled Azure SQL Managed Instance. For more information, see Azure Hybrid Benefit cost calculator.

  • (Azure Stack HCI platform architecture) Save on environment costs. Evaluate whether the following options can help optimize your resource usage.

    • Take advantage of discount programs that Microsoft offers. Consider using Azure Hybrid Benefit to reduce the cost to run Azure Stack HCI and Windows Server workloads. For more information, see Azure Hybrid Benefit for Azure Stack HCI.

    • Explore promotional offers. Take advantage of the Azure Stack HCI 60-day free trial after registration for initial proof of concepts or validations.

  • (Azure Stack HCI platform architecture) Save on operational costs.

    • Evaluate technology options for patching, updating, and other operations. Update Manager is free for Azure Stack HCI clusters that have Azure Hybrid Benefit and Azure Arc VM management enabled. For more information, see Update Manager FAQ and Update Manager pricing.

    • Evaluate costs related to observability. Set up alert rules and data collection rules (DCRs) to meet your monitoring and auditing needs. The amount of data that your workload ingests, processes, and retains directly influences costs. Optimize by using smart retention policies, limiting the number and frequency of alerts, and choosing the right storage tier for storing logs. For more information, see Cost Optimization guidance for Log Analytics.

  • (Workload architecture) Evaluate density over isolation. Use AKS on Azure Stack HCI to improve density and simplify workload management so that you can enable containerized applications to scale across multiple datacenter or edge locations. For more information, see AKS on Azure Stack HCI pricing.

Recommendations

Recommendation Benefit
Use Azure Hybrid Benefit for Azure Stack HCI if you have Windows Server Datacenter licenses with Software Assurance. With Azure Hybrid Benefit for Azure Stack HCI, you can maximize the value of your on-premises licenses and modernize your existing infrastructure to Azure Stack HCI at no additional cost.
Choose either the Windows Server subscription add-on or bring your own license to license and activate the Windows Server VMs and use them on Azure Stack HCI. For more information, see License Windows Server VMs on Azure Stack HCI. While you can use any existing Windows Server licenses and activation methods available, optionally, you can enable "Windows Server subscription add-on" available for Azure Stack HCI only to subscribe Windows Server guest licenses through Azure which is charged for the total number of physical cores in the Azure Stack HCI cluster.
Use the Azure verification for VMs benefit extended to Azure Stack HCI so that supported Azure-exclusive workloads can work outside of the cloud. This benefit is enabled by default on Azure Stack HCI version 23H2 or later. Use this benefit so that the VMs can operate in other Azure environments and workloads can benefit from offers that are available only in Azure, such as Extended Security Updates enabled by Azure Arc.

Operational Excellence

Operational Excellence primarily focuses on procedures for development practices, observability, and release management.

The Operational Excellence design principles provide a high-level design strategy for achieving those goals for the operational requirements of the workload.

Monitoring and diagnostics are crucial. You can use metrics to measure performance statistics and to troubleshoot and remediate problems quickly. For more information about how to troubleshoot problems, see Operational Excellence design principles and Collect diagnostic logs for Azure Stack HCI.

Design checklist

Start your design strategy based on the design review checklist for Operational Excellence for defining processes for observability, testing, and deployment related to Azure Stack HCI.

  • (Azure Stack HCI platform architecture) Increase supportability of Azure Stack HCI. Observability is enabled by default at the time of deployment. These capabilities enhance the supportability of the platform. Telemetry and diagnostic information is shared securely from the platform by using the AzureEdgeTelemetryAndDiagnostics extension, which is installed on all Azure Stack HCI cluster nodes by default. For more information, see Azure Stack HCI observability.

  • (Azure Stack HCI platform architecture) Use Azure services to reduce operational complexity and increase management scale. Azure Stack HCI is integrated with Azure to enable services such as Update Manager for patching the platform and Azure Monitor for monitoring and alerting. You can use Azure Arc and Azure Policy to manage security configuration and compliance auditing. Implement Defender for Cloud to help manage cyber threats and vulnerability. Use Azure as the control plane for these operational processes and procedures to help reduce complexity, improve efficiencies of scale, and improve management consistency.

  • (Workload architecture) Plan IP address network range requirements for workloads in advance. Azure Stack HCI provides a platform to deploy and manage virtualized or containerized workloads. Also consider the IP address requirements for logical networks that your workload uses. Review these resources:

  • (Workload configuration) Enable monitoring and alerting for workloads that you deploy on Azure Stack HCI. You can use Azure Monitor for virtual machines, or VM Insights for Arc VMs, or use Container Insights and managed Prometheus AKS clusters.

    Evaluate whether you should use a centralized Log Analytics workspace for your workload. For an example of a shared log sink (data location), see Workload management and monitoring recommendations.

  • (Azure Stack HCI platform architecture) Use proper validation techniques for a safe deployment. Use the environmental checker tool in standalone mode to assess the readiness of the target environment before you deploy an Azure Stack HCI solution. This tool validates the proper configuration of required connectivity, hardware, Windows Server Active Directory, networks, and Azure Arc integration prerequisites.

  • (Azure Stack HCI platform architecture) Get current and stay current. Use the Azure Stack HCI solution catalog to stay current with the latest hardware OEM innovations for Azure Stack HCI cluster deployments. Consider using premium solutions to benefit from extra integration, turn-key deployment capabilities, and a simplified update experience.

    Use Update Manager to update the platform and manage the OS, core agents, and services, including solution extensions. Stay current, and consider using the "Enable automatic upgrade" setting where possible for extensions.

Recommendations

Recommendation Benefit
Enable Monitor Insights on Azure Stack HCI clusters to enhance monitoring and alerting by using native Azure capabilities.

Insights can monitor key Azure Stack HCI features by using the cluster performance counters and event log channels that are collected by the DCR.

For certain hardware infrastructure, such as Dell APEX, you can visualize hardware events in real time.

For more information, see Feature workbooks.
Azure manages Insights, so it's always up to date, it's scalable across multiple clusters, and it's highly customizable.

Insights provides access to default workbooks with basic metrics, along with specialized workbooks that are created for monitoring key features of Azure Stack HCI. This feature provides near real-time monitoring. You can create graphs and customized visualization by using aggregation and the filter functionality. You can also configure custom alert rules.

The cost of Insights is based on the quantity of data ingested and the data retention settings of the Log Analytics workspace. When you enable Azure Stack HCI Insights, we recommended that you use the DCR created by the Insights creation experience. The prefix of the DCR name is AzureStackHCI-. It's configured to collect only the required data.
Set up alerts, and configure the alert processing rules based on your organizational requirements. Get notified of changes in health, metrics, logs, or other types of observability data.

- Health alerts
- Log alerts
- Metric alerts

For more information, see Recommended rules for metric alerts.
Integrate Monitor alerts with Azure Stack HCI to get several key benefits at no extra cost. Get near real-time monitoring and customize alerts to notify the right team or admin for remediation.

You can collect a comprehensive list of metrics for compute, storage, and network resources in Azure Stack HCI. Perform advanced logic operations on your log data and evaluate metrics of your Azure Stack HCI system at regular intervals.
Use the update feature to integrate and manage various aspects of the Azure Stack HCI solution in one place. For more information, see About updates in Azure Stack HCI. The update orchestrator is installed during the initial Azure Stack HCI cluster deployment. This feature automates updates and management operations. To keep Azure Stack HCI in a supported state, make sure that you update your clusters on a regular cadence to move to new baseline builds when they become available. This method provides new capabilities and improvements to the platform.

For more information about release trains, the cadence of updates, and the support window of each baseline build, see Azure Stack HCI version 23H2 release information.
To help with hands-on skilling, labs, training events, product demos, or proof-of-concept projects, consider using Jumpstart HCIBox. Rapidly deploy Azure Stack HCI without the need for physical hardware by using a VM on Azure to deploy the solution. HCIBox supports Azure Stack HCI version 23H2 to enable rapid testing and evaluation of the latest capabilities of Azure edge products, such as native Azure Arc and AKS integration in a self-contained sandbox.

You can deploy this sandbox to an Azure subscription by using a VM that supports nested virtualization to emulate an Azure Stack HCI cluster inside an Azure VM. Get Azure Stack HCI features like the new cloud deployment feature with minimal manual effort.

For more information, see Microsoft Tech Community blog.

Performance Efficiency

Performance Efficiency is about maintaining user experience even when there's an increase in load by managing capacity. The strategy includes scaling resources, identifying and optimizing potential bottlenecks, and optimizing for peak performance.

The Performance Efficiency design principles provide a high-level design strategy for achieving those capacity goals against the expected usage.

Design checklist

Start your design strategy based on the design review checklist for Performance Efficiency. Define a baseline that's based on key indicators for Azure Stack HCI.

  • (Azure Stack HCI platform architecture) Use the Azure Stack HCI-validated hardware or integrated systems from OEM partner offerings. Consider using the premium solution builders in the Azure Stack HCI catalog to optimize the performance of your Azure Stack HCI environment.

  • (Azure Stack HCI platform storage architecture) Choose the right physical disk types for the Azure Stack HCI cluster nodes based on your workload performance and capacity requirements. For high-performance workloads that require low latency and high-throughput storage, consider using an all-flash (NVMe/SSD only) storage configuration. For general purpose compute or large storage capacity requirements, consider using hybrid storage (SSD or NVMe for cache tier and HDDs for capacity tier), which might provide increased storage capacity.

  • (Azure Stack HCI platform architecture) Use the Azure Stack HCI sizer tool during the cluster design (pre-deployment) phase. Azure Stack HCI clusters should be sized appropriately by using the workload capacity, performance, and resiliency requirements as inputs. The size determines the maximum number of physical nodes that can be offline simultaneously (cluster quorum), such as any planned (maintenance) or unplanned (power or hardware failure) events. For more information, see Cluster quorum overview.

  • (Azure Stack HCI platform architecture) Use all-flash (NVMe or SSD) based solutions for workloads that have high-performance or low-latency requirements. These workloads include but are not limited to highly transactional database technologies, production AKS clusters, or any mission-critical or business-critical workloads with low-latency or high-throughput storage requirements. Use all-flash deployments to maximize storage performance. All-NVMe or all-SSD configurations (especially at a very small scale) improve storage efficiency and maximize performance because no drives are used as a cache tier. For more information, see All-flash-based storage.

  • (Azure Stack HCI platform architecture) Establish a performance baseline for Azure Stack HCI cluster storage before you deploy production workloads. Configure Monitor Azure Stack HCI features with Insights to monitor the performance of a single Azure Stack HCI cluster or multiple clusters simultaneously.

  • (Azure Stack HCI platform architecture) Consider using the Monitor for Resilient File System (ReFS) deduplication and compression feature after you enable Insights for the Azure Stack HCI cluster. Determine whether you should use this feature based on your workload storage usage and capacity requirements. This feature provides monitoring for ReFS deduplication and compression savings, performance impact, and jobs. For more information, see Monitor ReFS deduplication and compression.

    As a minimum requirement, plan to reserve 1 x physical nodes (N+1) worth of capacity across the cluster to ensure that cluster nodes can be drained when they perform updates via Update Management. Consider reserving 2 physical nodes (N+2) nodes work of capacity for business-critical or mission-critical use cases.

Recommendations

Recommendation Benefit
If you select the advanced option to "create infrastructure volumes only" during Azure Stack HCI cluster deployment, we recommend that you create the virtual disks by using mirroring when you create workload volumes for performance-intensive workloads. This recommendation benefits workloads that have strict latency requirements or that need high throughput with a mix of random read and write input/output operations per second (IOPs), such as SQL Server databases, Kubernetes clusters, or other performance-sensitive VMs. Deploy the workload VHDs on volumes that use mirroring to maximize performance and resiliency. Mirroring is faster than any other resiliency type.
Consider using DiskSpd to test workload storage performance capabilities of the Azure Stack HCI cluster.

You can also use VMFleet to generate load and measure the performance of a storage subsystem. Evaluate whether you should use VMFleet for measuring storage subsystem performance.
Establish a baseline for Azure Stack HCI cluster performance before you deploy production workloads. DiskSpd allows administrators to test the storage performance of the cluster by using various command line parameters. The main function of DiskSpd is to issue read and write operations and output performance metrics, such as latency, throughput, and IOPs.

Tradeoffs

There are design tradeoffs with the approaches described in the pillar checklists. Here are some examples of advantages and drawbacks.

Building redundancy increases costs

  • Understand your workload's requirements up front, such as the workload RTO and RPO targets and storage performance requirements (IOPs and throughput), when you design and procure the hardware for an Azure Stack HCI solution. To deploy highly available workloads, we recommend a minimum of a three-node cluster, which enables three-way mirroring for workload volumes and data. For the compute resources, ensure that you deploy a minimum of "N+1 number of physical nodes", which reserves the capacity of a "single node worth of space" in the cluster at all times. For business-critical or mission-critical workloads, consider reserving "N+2 nodes worth of capacity" to provide increased resiliency. For example, if two nodes in the cluster are offline, the workload can remain online. This approach provides increased resiliency for a scenario such as, if a node running workload goes offline during a planned update procedure (resulting in two nodes being offline simultaneously).

  • For business-critical or mission-critical workloads, we recommend that you deploy two or more separate Azure Stack HCI clusters and deploy multiple instances of your workload services across the separate clusters. Use a workload design pattern that takes advantage of data replication and application load balancing technologies. For example, SQL Server always-on availability groups use synchronous or asynchronous database replication to achieve low RTO and RTO targets across separate clusters in different datacenters.

  • Consequently, an increase in workload resiliency and a decrease in RTO and RPO targets increases costs and requires well-architected applications and operational rigor.

Providing scalability without effective workload planning increases costs

  • Incorrect cluster sizing can lead to insufficient capacity or reduced return on investment (ROI) if the hardware is overprovisioned. Both scenarios affect costs.

  • Increased capacity equals higher costs. During the Azure Stack HCI cluster design phase, adequate planning is required to rightsize the capabilities and number of cluster nodes based on workload capacity requirements. Therefore, you must understand the workload requirements (vCPUs, memory, storage, and X number of VMs) and allow for some extra headroom in addition to projected growth. You can perform an add-node gesture when you use a "storage switched" design. But it can take a long time to get more hardware after your deployment. And an add-note gesture is more complex than sizing the cluster hardware and number of nodes (maximum 16 nodes) appropriately during the initial deployment.

  • There are disadvantages if you overprovision the node hardware specification and select the incorrect number of nodes (size of the cluster). For example, if the workload requirements are much smaller than the cluster's overall capacity and the hardware is underused throughout the hardware warranty period, the ROI value might decrease.

Azure policies

Azure provides an extensive set of built-in policies related to Azure Stack HCI and its dependencies. Some of the preceding recommendations can be audited through Azure Policy. For example, you can check whether:

  • Host and VM networking should be protected.
  • Encrypted volumes should be implemented.
  • Application control policies should be consistently enforced.
  • Secured-core requirements should be met.

Review the Azure Stack HCI built-in policies. Defender for Cloud has new recommendations that show the compliance state for the built-in policies. For more information, see Built-in policies for Azure Security Center.

If your workload runs on Azure Arc VMs that you deploy on Azure Stack HCI, consider built-in policies, such as denying the creation or modification of Extended Security Updates licenses. For more information, see Built-in policy definitions for Azure Arc-enabled workloads.

Consider creating custom policies to provide extra governance for both the Azure Stack HCI resources and Azure Arc VMs that you deploy on an Azure Stack HCI cluster. For example:

  • Auditing Azure Stack HCI host registration with Azure
  • Ensuring that hosts run the latest OS version
  • Checking for required hardware components and network configurations
  • Verifying the enablement of necessary Azure services and security settings
  • Confirming the installation of required extensions
  • Assessing the deployment of Kubernetes clusters and AKS integration

Azure Advisor recommendations

Azure Advisor is a personalized cloud consultant that helps you follow best practices to optimize your Azure deployments. Here are some recommendations that can help you improve the reliability, security, cost effectiveness, performance, and operational excellence of your VMs.

Next steps