Reliability in Azure HDInsight on Azure Kubernetes Service

This article describes reliability support in Azure HDInsight on Azure Kubernetes Service (AKS), and covers both specific reliability recommendations and disaster recovery and business continuity. For a more detailed overview of reliability principles in Azure, see Azure reliability.

Reliability recommendations

This section contains recommendations for achieving resiliency and availability. Each recommendation falls into one of two categories:

  • Health items cover areas such as configuration items and the proper function of the major components that make up your Azure Workload, such as Azure Resource configuration settings, dependencies on other services, and so on.

  • Risk items cover areas such as availability and recovery requirements, testing, monitoring, deployment, and other items that, if left unresolved, increase the chances of problems in the environment.

Reliability recommendations priority matrix

Each recommendation is marked in accordance with the following priority matrix:

Image Priority Description
High Immediate fix needed.
Medium Fix within 3-6 months.
Low Needs to be reviewed.

Reliability recommendations summary

Category Priority Recommendation
Availability Default and minimum virtual machine size recommendations
Auto Scale HDInsight on AKS Clusters
Monitoring How to integrate with Log Analytics
Monitoring with Azure Managed Prometheus and Grafana
Security Use NSG to restrict traffic to HDInsight on AKS

Availability zone support

Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones.

Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see Regions and availability zones.

Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see Recommendations for using availability zones and regions.

Azure HDInsight on AKS supports availability zone by leveraging Azure Kubernetes Service's ability to create zone redundant node pools. You can select which availability zones to deploy the cluster pool and cluster during their creation. Once the cluster pool or cluster is created, you can't change the availability zones.

Prerequisites

  • Availability zones are only supported for cluster pool version >= 1.2 and cluster version >= 1.2.1.

  • Azure HDInsight on AKS only has one default SKU and it supports AZ as long as the Azure region has AZ support.

    Below regions don't support AZ:

    Americas Europe Middle East Africa Asia Pacific
    West US Germany North
  • Some VM SKUs may not support all availability zones in a region. If you select those SKUs, HDInsight on AKS cluster pools or clusters don't support corresponding availability zones either.

SLA improvements

There are no increased SLAs for Azure HDInsight on AKS clusters with availability zones enabled.

Create a resource with availability zone enabled

  • Cluster Pools You can select one or more availability zones during cluster pool creation after you select the region.

  • Clusters You can select one or more availability zones during cluster creation.

Fault tolerance

To prepare for availability zone failure, it's recommended to over-provision capacity of service to ensure that your cluster can tolerate the loss of capacity from one availability zone down and continue to function without degraded performance during zone-wide outages. For instance, if you enable 3 availability zones, your cluster should tolerate 1/3 of the nodes down (round up to the nearest integer).

Zone down experience

Azure HDInsight on AKS service is zone redundant. During a zone-wide outage, the customer should expect degradation of performance due to capacity drop. Customers can still create new cluster pools and clusters in the availability zones that are not impacted. Existing clusters can function with reduced capacity. Individual open source workloads recommendations and best practices are provided on the documentation.

Disaster recovery and business continuity

Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see Recommendations for designing a disaster recovery strategy.

When it comes to DR, Microsoft uses the shared responsibility model. In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use service-specific features to support fast recovery to help develop your DR plan.

Azure HDInsight on AKS control plane service and databases are deployed across regions of Azure. Among these regions, the Azure HDInsight on AKS instances and database instances are isolated. When an outage at region level occurs, one region is down. All the resources in this region, including the RP (Resource Provider) of Azure HDInsight on AKS control plane, database of Azure HDInsight on AKS control plane and all customer clusters in this region. In this case, we can only wait for the regional outage to end. When the zonal outage is fully recovered, Azure HDInsight on AKS service is back and all customer clusters are back to normalcy. It's possible you may encounter some problems due to data inconsistency after the outage and may need a manual fix based on your application workloads.

Multi-region disaster recovery

Azure HDInsight on AKS currently doesn't support cross-region failover. Improving business continuity using cross region high availability disaster recovery requires architectural designs of higher complexity and higher cost. Customers may choose to design their own solution to back up key data and job status across different regions.

Outage detection, notification, and management

  • Use Azure monitoring tools on HDInsight on AKS to detect abnormal behavior in the cluster and set corresponding alert notifications. You can enable Log Analytics in various ways and use managed Prometheus service with Azure Grafana dashboards for monitoring. For more information, see Azure Monitor integration.

  • Subscribe to Azure health alerts to be notified about service issues, planned maintenance, health and security advisories for a subscription, service, or region. Health notifications that include the issue cause and resolute ETA help you to better execute failover and failbacks. For more information, see Manage service health and Azure Service Health documentation.

Single-region disaster recovery

Currently, Azure HDInsight on AKS only has one standard service offering and clusters are created in a single-region geography. Customers are responsible for diaster recovery settings based on the application requirements.

Capacity and proactive disaster recovery resiliency

Azure HDInsight on AKS and its customers operate under the Shared responsibility model, which means that the customer must address disaster recovery requirements for the service they deploy and control. To ensure that recovery is proactive, customers should always predeploy secondaries because there's no guarantee of capacity at time of impact for those who haven't preallocated.

Unlike HDInsight, the Virtual Machines used in HDInsight on AKS clusters require the same Quota as Azure VMs. For more information, see Capacity planning.

To learn more about the items discussed in this article, see: