Ensuring control plane resiliency with Operator Nexus Service

2025-06-03

The Nexus service is engineered to uphold control plane resiliency across various compute rack configurations.

Instances with three or more compute racks

Operator Nexus ensures the availability of three active Kubernetes control plane (KCP) nodes in instances with three or more compute racks. For configurations exceeding two compute racks, an extra spare node is also maintained. These nodes are strategically distributed across different racks to guarantee control plane resiliency, when possible.

Tip

The Kubernetes control plane is a set of components that manage the state of a Kubernetes cluster, schedule workloads, and respond to cluster events. It includes the API server, etcd storage, scheduler, and controller managers.

The remaining management nodes contain various operators which run the platform software and other components performing support capabilities for monitoring, storage, and networking.

During runtime upgrades, Operator Nexus implements a sequential upgrade of the control plane nodes which preserves resiliency throughout the upgrade process.

Three compute racks:

KCP = Kubernetes Control Plane Node MGMT = Management Node Pool Node

Rack 1	Rack 2	Rack 3
KCP	KCP	KCP
KCP-spare	MGMT	MGMT

Four or more compute racks:

Rack 1	Rack 2	Rack 3	Rack 4
KCP	KCP	KCP	KCP-spare
MGMT	MGMT	MGMT	MGMT

Instances with less than three compute racks

Operator Nexus maintains an active control plane node and, if available, a spare control plane instance. For instance, a two-rack configuration has one active Kubernetes Control Plane (KCP) node and one spare node.

Two compute racks:

Rack 1	Rack 2
KCP	KCP-spare
MGMT	MGMT

Single compute rack:

Operator Nexus supports control plane resiliency in single rack configurations by having three management nodes within the rack. For example, a single rack configuration with three management servers provides an equivalent number of active control planes to ensure resiliency within a rack.

Rack 1
KCP
KCP
KCP

Resiliency implications of lost quorum

In disaster situations when the control plane loses quorum, there are impacts to the Kubernetes API across the instance. This scenario can affect a workload's ability to read and write Custom Resources (CRs) and talk across racks.

Determining Control Plane Role

Troubleshooting failed Control Plane Quorum

Share via

Ensuring control plane resiliency with Operator Nexus Service

Instances with three or more compute racks

Instances with less than three compute racks

Resiliency implications of lost quorum

Related Links

Feedback

Additional resources