Troubleshoot control plane quorum loss

Follow the steps in this troubleshooting article when multiple control plane nodes are offline or unavailable.

Prerequisites

  • Install the latest version of the appropriate Azure CLI extensions.
  • Gather the following information:
    • Subscription ID
    • Cluster name and resource group
    • Bare-metal machine name
  • Ensure that you're signed in by using az login.

Symptoms

  • The Kubernetes API isn't available.
  • Multiple control plane nodes are offline or unavailable.

Procedure

  1. Identify the Azure Operator Nexus management nodes:

    • To identify the management nodes, run az networkcloud baremetalmachine list -g <ResourceGroup_Name>.

    • Sign in to the identified server.

    • Ensure that the ironic-conductor service is present on this node by using crictl ps -a |grep -i ironic-conductor. Here's example output:

      testuser@<servername> [ ~ ]$ sudo crictl ps -a |grep -i ironic-conductor
      <id>       <id>       6 hours ago       Running       ironic-conductor       0       <id>
      
  2. Determine the integrated Dell remote access controller (iDRAC) IP of the server:

    • Run the command az networkcloud cluster list -g <RG_Name>.

    • The output of the command is JSON with the iDRAC IP.

      {
              "bmcConnectionString": "redfish+https://xx.xx.xx.xx/redfish/v1/Systems/System.Embedded.1",
              "bmcCredentials": {
                "username": "<username>"
              },
              "bmcMacAddress": "<bmcMacAddress>",
              "bootMacAddress": "<bootMacAddress",
              "machineDetails": "extraDetails",
              "machineName": "<machineName>",
              "rackSlot": <rackSlot>,
              "serialNumber": "<serialNumber>"
      },
      
  3. Access the integrated iDRAC graphical user interface (GUI) by using the IP in your browser to shut down affected management servers.

    Screenshot that shows an iDRAC GUI and the button to perform a graceful shutdown.

  4. When all affected management servers are down, turn on the servers by using the iDRAC GUI.

    Screenshot that shows an iDRAC GUI and the button to perform the power command.

The servers should now be restored.