Edit

Share via


Run emergency bare metal actions outside of Azure using nexusctl

This article describes the nexusctl utility, which can be used in break-glass (emergency) situations to run simple actions on BareMetal Machines (BMM) without using the Azure console or command-line interface (CLI).

Caution

Don't perform any action against control or management plane servers without first consulting with Microsoft support personnel, doing so could affect the integrity of the Operator Nexus Cluster.

Important

Multiple disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected. This check is done to maintain the integrity of the Nexus Cluster instance and avoid multiple KCP nodes become nonoperational at once due to simultaneous disruptive actions. Rejected disruptive action commands can be due to either already running against another KCP node or if the full KCP isn't available. If multiple nodes become nonoperational, it breaks the healthy quorum threshold of the Kubernetes Control Plane.

The actions listed are considered disruptive to BareMetal Machines (BMM):

  • Power off a BMM
  • Restart a BMM
  • Make a BMM unschedulable (cordon with evacuate, drains the node)
  • Reimage a BMM
  • Replace a BMM

Leaving only the nondisruptive actions:

  • Start a BMM
  • Make a BMM unschedulable (cordon without evacuate, doesn't drain node)
  • Make a BMM schedulable (uncordon)

Warning

When the BMM is provisioned and has joined the Cluster, only use the Az CLI BMM commands to change the powerState. The nexusctl command should only be used for BMM that are not part of the Nexus Cluster (have not been provisioned), or the access to the server with the Az CLI is not possible.

Scenarios that may need the use of nexusctl:

  • If the BMM is not is not joined to the cluster, the only method would be to reboot or power on/off using nexusctl.
  • A BMM that has issues such as being hung up during boot up
  • A firmware issue during the deployment (where the BMM is stuck in the IPA bootloader)

Prerequisites

  • A BareMetalMachineKeySet must be available to allow ssh access to the bare metal machines. The user must have superuser privilege level.
  • The platform Kubernetes must be up and running on site.

Overview

nexusctl is a stand-alone program that can be run using nc-toolbox from an ssh session on any control-plane node. Since nexusctl is contained in the nc-toolbox-breakglass container image and isn't installed directly on the host, it must be run with a command-line like:

sudo nc-toolbox nc-toolbox-breakglass nexusctl <command> [subcommand] [options]

(nc-toolbox must always be run as root or with sudo.)

Like most other command-line programs, the --help option can be used with any command or subcommand to see more information:

sudo nc-toolbox nc-toolbox-breakglass nexusctl --help
sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal --help
sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal power-off --help

etc.

Note

There is no bulk execution against multiple machines. Commands are executed on a machine by machine basis.

Power off a bare metal machine

Important

Powering off a KCP node using nexusctl is considered disruptive. If the KCP is provisioned and part of the Nexus Cluster, doing a power-off action with nexusctl could affect the integrity of the Operator Nexus Cluster.

A single bare metal machine can be powered off by connecting to a control-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal power-off --name <machine name>

If the command is accepted, nexusctl responds with another command line that can be used to view the status of the long-running operation. Prefix this command with sudo nc-toolbox nc-toolbox-breakglass, as follows:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal power-off --status --name <machine name> --operation-id <operation-id>

The status is blank until the operation completes and reaches either a "succeeded" or "failed" state. While it's blank, assume that the operation is still in progress.

Start a bare metal machine

A single bare metal machine can be started by connecting to a control-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal start --name <machine name>

If the command is accepted, nexusctl responds with another command line that can be used to view the status of the long-running operation. Prefix this command with sudo nc-toolbox nc-toolbox-breakglass, as follows:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal start --status --name <machine name> --operation-id <operation-id>

The status is blank until the operation completes and reaches either a "succeeded" or "failed" state. While it's blank, assume that the operation is still in progress.

Unmanage a bare metal machine (set to unmanaged state)

A single bare metal machine can be switched to an unmanaged state by connecting to a control-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal unmanage --name <machine name>

While in an unmanaged state, no actions are permitted for that machine, except for returning it to a managed state (see next section). This function can be used to keep a bare metal machine powered off if it's in a rebooting crash loop.

unmanage isn't a long-running command, so there's no associated command to check operation status.

Manage a bare metal machine (set to managed state)

A single bare metal machine can be switched to a managed state by connecting to a control-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal manage --name <machine name>

manage isn't a long-running command, so there's no associated command to check operation status.