Run emergency bare metal actions outside of Azure using nexusctl

This article describes the nexusctl utility, which can be used in break-glass (emergency) situations to run simple actions on bare metal machines without using the Azure console or command-line interface (CLI).

Caution

Do not perform any action against management servers without first consulting with Microsoft support personnel. Doing so could affect the integrity of the Operator Nexus Cluster.

Important

Disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected if there is another disruptive action command already running against another KCP node or if the full KCP is not available. This check is done to maintain the integrity of the Nexus instance and ensure multiple KCP nodes don't become non-operational at once due to simultaneous disruptive actions. If multiple nodes become non-operational, it will break the healthy quorum threshold of the Kubernetes Control Plane.

Powering off a KCP node is the only nexusctl action considered disruptive in the context of this check.

Prerequisites

  • A BareMetalMachineKeySet must be available to allow ssh access to the bare metal machines. The user must have superuser privilege level.
  • The platform Kubernetes must be up and running on site.

Overview

nexusctl is a stand-alone program that can be run using nc-toolbox from an ssh session on any control-plane or management-plane node. Since nexusctl is contained in the nc-toolbox-breakglass container image and isn't installed directly on the host, it must be run with a command-line like:

sudo nc-toolbox nc-toolbox-breakglass nexusctl <command> [subcommand] [options]

(nc-toolbox must always be run as root or with sudo.)

Like most other command-line programs, the --help option can be used with any command or subcommand to see more information:

sudo nc-toolbox nc-toolbox-breakglass nexusctl --help
sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal --help
sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal power-off --help

etc.

Note

There is no bulk execution against multiple machines. Commands are executed on a machine by machine basis.

Power off a bare metal machine

A single bare metal machine can be powered off by connecting to a control-plane or management-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal power-off --name <machine name>

If the command is accepted, nexusctl responds with another command line that can be used to view the status of the long-running operation. Prefix this command with sudo nc-toolbox nc-toolbox-breakglass, as follows:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal power-off --status --name <machine name> --operation-id <operation-id>

The status is blank until the operation completes and reaches either a "succeeded" or "failed" state. While it's blank, assume that the operation is still in progress.

Start a bare metal machine

A single bare metal machine can be started by connecting to a control-plane or management-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal start --name <machine name>

If the command is accepted, nexusctl responds with another command line that can be used to view the status of the long-running operation. Prefix this command with sudo nc-toolbox nc-toolbox-breakglass, as follows:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal start --status --name <machine name> --operation-id <operation-id>

The status is blank until the operation completes and reaches either a "succeeded" or "failed" state. While it's blank, assume that the operation is still in progress.

Unmanage a bare metal machine (set to unmanaged state)

A single bare metal machine can be switched to an unmanaged state by connecting to a control-plane or management-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal unmanage --name <machine name>

While in an unmanaged state, no actions are permitted for that machine, except for returning it to a managed state (see next section). This function can be used to keep a bare metal machine powered off if it's in a rebooting crash loop.

unmanage isn't a long-running command, so there's no associated command to check operation status.

Manage a bare metal machine (set to managed state)

A single bare metal machine can be switched to a managed state by connecting to a control-plane or management-plane node via ssh and running the command:

sudo nc-toolbox nc-toolbox-breakglass nexusctl baremetal manage --name <machine name>

manage isn't a long-running command, so there's no associated command to check operation status.