Ensuring control plane resiliency with Operator Nexus Service
The Nexus service is engineered to uphold control plane resiliency across various compute rack configurations.
Operator Nexus ensures the availability of three active Kubernetes control plane (KCP) nodes in instances with three or more compute racks. For configurations exceeding two compute racks, an extra spare node is also maintained. These nodes are strategically distributed across different racks to guarantee control plane resiliency, when possible.
Tip
The Kubernetes control plane is a set of components that manage the state of a Kubernetes cluster, schedule workloads, and respond to cluster events. It includes the API server, etcd storage, scheduler, and controller managers.
The remaining management nodes contain various operators which run the platform software as well as other components performing support capabilities for monitoring, storage and networking.
During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes, thereby preserving resiliency throughout the upgrade process.
Three compute racks:
Rack 1 | Rack 2 | Rack 3 |
---|---|---|
KCP | KCP | KCP |
KCP-spare | MGMT | MGMT |
Four or more compute racks:
Rack 1 | Rack 2 | Rack 3 | Rack 4 |
---|---|---|---|
KCP | KCP | KCP | KCP-spare |
MGMT | MGMT | MGMT | MGMT |
Operator Nexus maintains an active control plane node and, if available, a spare control plane instance. For instance, a two-rack configuration has one active Kubernetes Control Plane (KCP) node and one spare node.
Two compute racks:
Rack 1 | Rack 2 |
---|---|
KCP | KCP-spare |
MGMT | MGMT |
Single compute rack:
Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers will provide an equivalent number of active control planes to ensure resiliency within a rack.
Rack 1 |
---|
KCP |
KCP |
KCP |
In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.