Operations management considerations for Azure Kubernetes Service

Kubernetes is a relatively new technology, rapidly evolving with an impressive ecosystem. As such, it can be challenging to manage and protect it.

Operations baseline for AKS

You can work toward operational excellence and customer success by properly designing your Azure Kubernetes Service (AKS) solution with management and monitoring in mind.

Design considerations

Consider the following factors:

  • Be aware of AKS limits. Use multiple AKS instances to scale beyond those limits.
  • Be aware of ways to isolate workloads logically within a cluster and physically in separate clusters.
  • Be aware of ways to control resource consumption by workloads.
  • Be aware of ways to help Kubernetes understand the health of your workloads.
  • Be aware of various virtual machine sizes and the impact of using one or the other. Larger VMs can handle more load. Smaller VMs can easier be replaced by others when unavailable for planned and unplanned maintenance. Also, be aware of node pools and VMs in a scale set, allowing virtual machines of different sizes in the same cluster. Larger VMs are more optimal because AKS reserves a smaller percentage of its resources, making more of its resources available for your workloads.
  • Be aware of ways to monitor and log AKS. Kubernetes consists of various components, and monitoring and logging should provide insight into its health, trends, and potential issues.
  • Building on monitoring and logging, there can be many events generated by Kubernetes or applications running on top. Alerts can help differentiate between log entries for historical purposes and those that require immediate action.
  • Be aware of updates and upgrades that you should do. At the Kubernetes level, there are major, minor, and patch versions. The customer should apply these updates to remain supported according to the policy in upstream Kubernetes. At the worker host level, OS kernel patches might require a reboot, which the customer should do, and upgrade to new OS versions. In addition to manually upgrading a cluster, you can set an auto-upgrade channel on your cluster.
  • Be aware of the cluster's resource limitations and individual workloads.
  • Be aware of the differences between horizontal pod autoscaler and cluster autoscaler
  • Consider securing traffic between pods using network policies and the Azure policies plug-in
  • To help troubleshoot your application and services running on AKS, you might need to view the logs generated by control plane components. You might want to enable resource logs for AKS since logging is not enabled by default.

Recommendations

  • Understand AKS limits:

  • Use logical isolation at the namespace level to separate applications, teams, environments, and business units. Multitenancy and cluster isolation. Also, node pools can help at nodes with different node specifications, and maintenance like Kubernetes upgrades multiple node pools

  • Plan and apply resource quotas at the namespace level. If pods don't define resource requests and limits, reject the deployment using policies, and so on. This does not apply to kube-system pods since not all kube-system pods have requests and limits. Monitor resource usage and adjust quotas as needed. Basic scheduler features

  • Add health probes to your pods. Make sure pods contain livenessProbe, readinessProbe, and startupProbe AKS health probes.

  • Use VM sizes big enough to contain multiple container instances, so you get the benefits of increased density, but not so big that your cluster can't handle the workload of a failing node.

  • Use a monitoring solution. Azure Monitor container insights is set up by default and provides easy access to many insights. You can use Prometheus integration if you want to drill deeper or have experience using Prometheus. If you also want to run a monitoring application on AKS, you should also use Azure Monitor to monitor that application.

  • Use Azure Monitor container insights metric alerts to provide notifications when things need direct action.

  • Use automatic node pool scaling feature together with horizontal pod autoscaler to meet application demands and to mitigate peak hours loads.

  • Use Azure Advisor to get best practice recommendations on cost, security, reliability, operational excellence, and performance. Also, use Microsoft Defender for Cloud to prevent and detect threats like image vulnerabilities.

  • Use Azure Arc-enabled Kubernetes to manage non-AKS Kubernetes clusters in Azure using Azure Policy, Defender for Cloud, GitOps, and so on.

  • Use pod requests and limits to manage the compute resources within an AKS cluster. Pod requests and limits inform the Kubernetes scheduler about which compute resources to assign to a pod.

Business continuity/disaster recovery to protect and recover AKS

Your organization needs to design suitable Azure Kubernetes Service (AKS) platform-level capabilities to meet its specific requirements. These application services have requirements related to recovery time objective (RTO) and recovery point objective (RPO). There are multiple considerations to address for AKS disaster recovery. Your first step is to define a service-level agreement (SLA) for your infrastructure and application. Learn about the SLA for Azure Kubernetes Service (AKS). See the SLA details section for information about monthly uptime calculations.

Design considerations

Consider the following factors:

  • The AKS cluster should use multiple nodes in a node pool to provide the minimum level of availability for your application.

  • Set pod requests and limits. Setting these limits lets Kubernetes:

    • Efficiently give CPU and memory resources to the pods.
    • Have higher container density on a node.

    Limits can also increase reliability with reduced costs because of better use of hardware.

  • AKS suitability for Availability Zones or availability sets.

    • Choose a region that supports Availability Zones.
    • Availability Zones can only be set when the node pool is created and can't be changed later. Multizone support only applies to node pools.
    • For complete zonal benefit, all service dependencies must also support zones. If a dependent service doesn't support zones, a zone failure could cause that service to fail.
    • Run multiple AKS clusters in different paired regions for higher availability beyond what Availability Zones can achieve. If an Azure resource supports geo-redundancy, provide the location where the redundant service has its secondary region.
  • You should know the guidelines for disaster recovery in AKS. Then consider whether they apply to the AKS clusters that you use for Azure Dev Spaces.

  • Consistently create backups for applications and data.

    • A non-stateful service can be replicated efficiently.
    • If you need to store state in the cluster, back up the data frequently in the paired region. One consideration is that properly storing state in the cluster can be complicated.
  • Cluster update and maintenance.

    • Always keep your cluster up to date.
    • Be aware of the release and deprecation process.
    • Plan your updates and maintenance.
  • Network connectivity if a failover occurs.

    • Choose a traffic router that can distribute traffic across zones or regions, depending on your requirement. This architecture deploys Azure Load Balancer because it can distribute non-web traffic across zones.
    • If you need to distribute traffic across regions, consider using Azure Front Door.
  • Planned and unplanned failovers.

    • When setting up each Azure service, choose features that support disaster recovery. For example, this architecture enables Azure Container Registry for geo-replication. You can still pull images from the replicated region if a region goes down.
  • Maintain engineering DevOps capabilities to reach service-level goals.

  • Determine whether you need a financially backed SLA for your Kubernetes API server endpoint.

Design recommendations

The following are best practices for your design:

  • Use three nodes for the system node pool. For the user node pool, start with no less than two nodes. If you need higher availability, set up more nodes.

  • Isolate your application from the system services by placing it in a separate node pool. This way, Kubernetes services run on dedicated nodes and don't compete with other services. Use tags, labels, and taints to identify the node pool to schedule your workload.

  • Regular upkeep of your cluster, for example, making timely updates, is crucial for reliability. Be mindful of the support window for Kubernetes versions on AKS and plan your updates. Also, monitoring the health of the pods through probes is recommended.

  • Whenever possible, remove service state from inside containers. Instead, use an Azure platform as a service (PaaS) that supports multiregion replication.

  • Ensure pod resources. It's highly recommended that deployments specify pod resource requirements. The scheduler can then appropriately schedule the pod. Reliability depreciates when pods aren't scheduled.

  • Set up multiple replicas in the deployment to handle disruptions like hardware failures. For planned events like updates and upgrades, a disruption budget can ensure the required number of pod replicas exist to handle the expected application load.

  • Your applications might use Azure Storage for their data. Because your applications are spread across multiple AKS clusters in different regions, you must keep the storage synced. Here are two common ways to replicate storage:

    • Infrastructure-based asynchronous replication
    • Application-based asynchronous replication
  • Estimate pod limits. Test and establish a baseline. Start with equal values for requests and limits. Then, gradually tune those values until you've established a threshold that can cause instability in the cluster. Pod limits can be specified in your deployment manifests.

    The built-in features provide a solution to the complex task of handling failures and disruptions in service architecture. These configurations help to simplify both design and deployment automation. When an organization has defined a standard for the SLA, RTO, and RPO, it can use built-in services to Kubernetes and Azure to achieve its business goals.

  • Set pod disruption budgets. This setting checks how many replicas in a deployment you can take down during an update or upgrade event.

  • Enforce resource quotas on the service namespaces. The resource quota on a namespace ensures pod requests and limits are properly set on a deployment.

    • Setting resource quotas at the cluster level can cause problems when deploying partner services that don't have proper requests and limits.
  • Store your container images in Azure Container Registry and geo-replicate the registry to each AKS region.

  • Use the Uptime SLA to enable a financially backed, higher SLA for all clusters hosting production workloads. Uptime SLA guarantees 99.95% availability of the Kubernetes API server endpoint for clusters that use Availability Zones and 99.9% of availability for clusters that don't use Availability Zones. Your nodes, node pools, and other resources are covered under their SLA. AKS offers a free tier with a service level objective (SLO) of 99.5% for its control plane components. Clusters without the Uptime SLA enabled shouldn't be used for production workloads.

  • Use multiple regions and peering locations for Azure ExpressRoute connectivity.

    If an outage affecting an Azure region or peering provider location occurs, a redundant hybrid network architecture can help ensure uninterrupted cross-premises connectivity.

  • Interconnect regions with global virtual network peering. If the clusters need to talk to each other, connecting both virtual networks to each other can be achieved through virtual network peering. This technology interconnects virtual networks to each other, providing high bandwidth across Microsoft's backbone network, even across different geographic regions.

  • Using split TCP-based anycast protocol, Azure Front Door ensures that your end users promptly connect to the nearest Front Door point of presence. Other features of Azure Front Door include:

    • TLS termination
    • Custom domain
    • Web Application Firewall
    • URL rewrite
    • Session affinity

    Review the needs of your application traffic to learn which solution is the most suitable.