Cluster design and operations

This article covers cluster configuration and network design. Learn how to future-proof scalability by automating infrastructure provisioning. Provisioning is the process of setting up the IT infrastructure that you want. Automated infrastructure provisioning supports a remote installation and sets up virtual environments. It also helps you maintain high availability by planning for business continuity and disaster recovery.

Plan, train, and proof

As you get started, the checklist and Kubernetes resources below will help you plan the cluster design. By the end of this section, you'll be able to answer these questions:

  • Have you identified the networking design requirements for your cluster?
  • Do you have services with varying requirements? How many node pools are you going to use?


  • Identify network design considerations. Understand cluster network design considerations, compare network models, and choose the Kubernetes networking plug-in that fits your needs. For Azure Container Networking Interface (CNI) networking, consider the number of IP addresses required as a multiple of the maximum pods per node (default of 30) and number of nodes. Add one node required during upgrade. When choosing load balancer services, consider using an ingress controller when there are too many services to reduce the number of exposed endpoints. For Azure CNI, the service CIDR has to be unique across the virtual network and all connected virtual networks to ensure appropriate routing.

    To learn more, see:

  • Create multiple node pools. To support applications that have different compute or storage demands, you can optionally configure your cluster with multiple node pools. For example, use more node pools to provide GPUs for compute-intensive applications or access to high-performance SSD storage. For more information, see Create and manage multiple node pools for a cluster in Azure Kubernetes Service.

  • Decide on availability requirements. A minimum of two pods behind Azure Kubernetes Service ensures high availability of your application if there is pod failures or restarts. Use three or more pods to handle load during pod failures and restarts. For the cluster configuration, a minimum of two nodes in an availability set or virtual machine scale set is required to meet the service-level agreement of 99.95%. Use at least three pods to ensure pod scheduling during node failures and reboots.

    To provide a higher level of availability to your applications, clusters can be distributed across Availability Zones. These zones are physically separate datacenters within a given region. When the cluster components are distributed across multiple zones, your cluster can tolerate a failure in one of the zones. Your applications and management operations remain available even if an entire datacenter experiences an outage. For more information, see Create an Azure Kubernetes Service (AKS) cluster that uses Availability Zones.

Go to production and apply infrastructure best practices

As you prepare the application for production, implement a minimum set of best practices. Use this checklist at this stage. By the end of this section, you'll be able to answer these questions:

  • Are you able to confidently redeploy the cluster infrastructure?
  • Have you applied resource quotas?


  • Automate cluster provisioning. With infrastructure as code, you can automate infrastructure provisioning to provide more resiliency during disasters and gain agility to quickly redeploy the infrastructure as needed. For more information, see Create a Kubernetes cluster with Azure Kubernetes Service using Terraform.

  • Plan for availability using pod disruption budgets. To maintain the availability of applications, define pod disruption budgets (PDB) to ensure that a minimum number of pods are available in the cluster during hardware failures or cluster upgrades. To learn more, see Plan for availability using pod disruption budgets.

  • Enforce resource quotas on namespaces. Plan and apply resource quotas at the namespace level. Quotas can be set on compute resources, storage resources, and object count. For more information, see Enforce resource quotas.

Optimize and scale

Once the application is in production, how can you optimize your workflow and prepare your application and team to scale? Use the optimization and scaling checklist to prepare. By the end of this section, you'll be able to answer these questions:

  • Do you have a plan for business continuity and disaster recovery?
  • Can your cluster scale to meet application demands?
  • Are you able to monitor your cluster and application health and receive alerts?