Reliability in Azure HDInsight on Azure Kubernetes Service

This article describes reliability support in Azure HDInsight on Azure Kubernetes Service (AKS), and covers both specific reliability recommendations and disaster recovery and business continuity. For a more detailed overview of reliability principles in Azure, see Azure reliability.

Reliability recommendations

This section contains recommendations for achieving resiliency and availability. Each recommendation falls into one of two categories:

  • Health items cover areas such as configuration items and the proper function of the major components that make up your Azure Workload, such as Azure Resource configuration settings, dependencies on other services, and so on.

  • Risk items cover areas such as availability and recovery requirements, testing, monitoring, deployment, and other items that, if left unresolved, increase the chances of problems in the environment.

Reliability recommendations priority matrix

Each recommendation is marked in accordance with the following priority matrix:

Image Priority Description
High Immediate fix needed.
Medium Fix within 3-6 months.
Low Needs to be reviewed.

Reliability recommendations summary

Category Priority Recommendation
Availability Default and minimum virtual machine size recommendations
Auto Scale HDInsight on AKS Clusters
Monitoring How to integrate with Log Analytics
Monitoring with Azure Managed Prometheus and Grafana
Security Use NSG to restrict traffic to HDInsight on AKS

Availability zone support

Azure availability zones are at least three physically separate groups of datacenters within each Azure region. Datacenters within each zone are equipped with independent power, cooling, and networking infrastructure. In the case of a local zone failure, availability zones are designed so that if the one zone is affected, regional services, capacity, and high availability are supported by the remaining two zones.

Failures can range from software and hardware failures to events such as earthquakes, floods, and fires. Tolerance to failures is achieved with redundancy and logical isolation of Azure services. For more detailed information on availability zones in Azure, see Regions and availability zones.

Azure availability zones-enabled services are designed to provide the right level of reliability and flexibility. They can be configured in two ways. They can be either zone redundant, with automatic replication across zones, or zonal, with instances pinned to a specific zone. You can also combine these approaches. For more information on zonal vs. zone-redundant architecture, see Recommendations for using availability zones and regions.

Currently, Azure HDInsight on AKS doesn't support availability zone in its service offerings.

Disaster recovery and business continuity

Disaster recovery (DR) is about recovering from high-impact events, such as natural disasters or failed deployments that result in downtime and data loss. Regardless of the cause, the best remedy for a disaster is a well-defined and tested DR plan and an application design that actively supports DR. Before you begin to think about creating your disaster recovery plan, see Recommendations for designing a disaster recovery strategy.

When it comes to DR, Microsoft uses the shared responsibility model. In a shared responsibility model, Microsoft ensures that the baseline infrastructure and platform services are available. At the same time, many Azure services don't automatically replicate data or fall back from a failed region to cross-replicate to another enabled region. For those services, you are responsible for setting up a disaster recovery plan that works for your workload. Most services that run on Azure platform as a service (PaaS) offerings provide features and guidance to support DR and you can use service-specific features to support fast recovery to help develop your DR plan.

Currently, Azure HDInsight on AKS CP(Control Plane) service and databases are deployed across regions of Azure. Among these regions, the Azure HDInsight on AKS instances and database instances are isolated. When an outage at region level occurs, one region is down. All the resources in this region, including the RP(Resource Provider) of Azure HDInsight on AKS CP, database of Azure HDInsight on AKS CP and all customer clusters in this region. In this case, we can only wait for the regional outage to end. When the outage is recovered, the Azure HDInsight on AKS service is back and all customer clusters are back, too. It's possible that there may be some problems due to data inconsistency after the outage and needs a manual fix.

Multi-region disaster recovery

Azure HDInsight on AKS currently doesn't support cross-region failover. Improving business continuity using cross region high availability disaster recovery requires architectural designs of higher complexity and higher cost. Customers may choose to design their own solution to back up key data and job status across different regions.

Outage detection, notification, and management

  • Use Azure monitoring tools on HDInsight on AKS to detect abnormal behavior in the cluster and set corresponding alert notifications. You can enable Log Analytics in various ways and use managed Prometheus service with Azure Grafana dashboards for monitoring. For more information, see Azure Monitor integration.

  • Subscribe to Azure health alerts to be notified about service issues, planned maintenance, health and security advisories for a subscription, service, or region. Health notifications that include the issue cause and resolute ETA help you to better execute failover and failbacks. For more information, see Manage service health and Azure Service Health documentation.

Single-region disaster recovery

Currently, Azure HDInsight on AKS only has one standard service offering and clusters are created in a single-region geography. Customers are responsible for diaster recovery.

Capacity and proactive disaster recovery resiliency

Azure HDInsight on AKS and its customers operate under the Shared responsibility model, which means that the customer must address DR for the service they deploy and control. To ensure that recovery is proactive, customers should always predeploy secondaries because there's no guarantee of capacity at time of impact for those who haven't preallocated.

Unlike the original version of HDInsight, the Virtual Machines used in HDInsight on AKS clusters require the same Quota as Azure VMs. For more information, see Capacity planning.

To learn more about the items discussed in this article, see: