Edit

Share via


Best practices for Bare Metal Machine operations

This article provides best practices for Bare Metal Machine (BMM) lifecycle management operations. The aim is to highlight common pitfalls and essential prerequisites.

Read important disclaimers

Caution

Don't perform any action against control or management plane servers without first consulting with Microsoft support personnel, doing so could affect the integrity of the Operator Nexus Cluster.

Important

Multiple disruptive command requests against a Kubernetes Control Plane (KCP) node are rejected. This check is done to maintain the integrity of the Nexus Cluster instance and avoid multiple KCP nodes become nonoperational at once due to simultaneous disruptive actions. Rejected disruptive action commands can be due to either already running against another KCP node or if the full KCP isn't available. If multiple nodes become nonoperational, it breaks the healthy quorum threshold of the Kubernetes Control Plane.

The actions listed are considered disruptive to BareMetal Machines (BMM):

  • Power off a BMM
  • Restart a BMM
  • Make a BMM unschedulable (cordon with evacuate, drains the node)
  • Reimage a BMM
  • Replace a BMM

Leaving only the nondisruptive actions:

  • Start a BMM
  • Make a BMM unschedulable (cordon without evacuate, doesn't drain node)
  • Make a BMM schedulable (uncordon)

Prerequisites

  1. Install the latest version of the appropriate CLI extensions.
  2. Request access to run the Azure Operator Nexus network fabric (NF) and network cloud CLI extension commands.
  3. Sign in to the Azure CLI and select the subscription where the cluster is deployed.
  4. Collect the following information:
    • Subscription ID (SUBSCRIPTION)
    • Cluster name (CLUSTER)
    • Resource group (CLUSTER_RG)
    • Managed resource group (CLUSTER_MRG) - BareMetal Machines (BMM) resources are present in the managed resource group
    • BareMetal Machine Name (BMM_NAME) that requires lifecycle management operations

Identify the best-fit corrective approach

Troubleshooting technical problems requires a systematic approach. One effective method is to start with the least invasive solution and, if necessary, work your way up to more complex and potentially disruptive measures. Keep in mind that these troubleshooting methods might not always be effective for all scenarios and accounting for various other factors might require a different approach. For this reason, it's essential to understand the available options well when troubleshooting a Bare Metal Machine for failures to determine the most appropriate corrective action.

General advice while troubleshooting

  • Familiarize yourself with the relevant documentation, including troubleshooting guides and how-to articles. Always refer to the latest documentation to stay informed about best practices and updates.
  • Avoid repeated failed operations by first attempting to identify the root cause of the failure before retrying the operation. Perform retry attempts in incremental steps to isolate and address specific issues.
  • Wait for Az CLI commands to run to completion and validate the state of the Bare Metal Machine resource before executing other steps.
  • Verify that the firmware and software versions are up-to-date before a new greenfield deployment to prevent compatibility issues between hardware and software versions. For more information about firmware compatibility, see Operator Nexus Platform Prerequisites.
  • Check the iDRAC credentials are correct and that the Bare Metal Machine is powered on.

Look at general network connectivity health

Ensure stable network connectivity to avoid interruptions during the process. Ignoring network stability could make operations fail to complete successfully and leave a Bare Metal Machine in an error or degraded state.

A quick look at Cluster resource's clusterConnectionStatus serves as one indicator of network connectivity health.

az networkcloud cluster show \
  -g $CLUSTER_MRG \
  -n $BMM_NAME \
  --subscription $SUBSCRIPTION \
  --query "clusterConnectionStatus" \
  -o table

Result
---------
Connected

Take a deeper look at the NetworkFabric resources by checking the NetworkFabric resources statuses, alerts, and metrics. See related articles:

Evaluate for any Bare Metal Machine warnings or degraded conditions which could indicate the need to resolve hardware, network, or server configuration problems. For more information, see Troubleshoot Degraded Status Errors on Bare Metal Machines and Troubleshoot Bare Metal Machine Warning Status.

Determine if firmware update jobs are running

Validate that there are no running firmware upgrade jobs through the BMC before initiating a replace or reimage operation. Interrupting an ongoing firmware upgrade can leave the Bare Metal Machine in an inconsistent state.

  • You can view in the iDRAC GUI the jobqueue or use run-read-command racadm jobqueque view to determine if there are firmware upgrade jobs running.
  • For more information about the run-read-command feature, see Bare Metal Run-Read Execution.
az networkcloud baremetalmachine run-read-command \
  -g $CLUSTER_MRG \
  -n $BMM_NAME \
  --subscription $SUBSCRIPTION \
  --limit-time-seconds 60 \
  --commands "[{command:'nc-toolbox nc-toolbox-runread racadm jobqueue view'}]" \
  --output-directory .

Here's an example output from the racadm jobqueue view command which shows Firmware Update.

[Job ID=JID_833540920066]
Job Name=Firmware Update: iDRAC
Status=Downloading
Start Time= [Not Applicable]
Expiration Time= [Not Applicable]
Message= [RED001: Job in progress.]
Percent Complete= [50%]

Here's an example output from the racadm jobqueue view command showing common happy-path statements.

-------------------------JOB QUEUE------------------------
[Job ID=JID_429400224349]
Job Name=Configure: Import Server Configuration Profile
Status=Completed
Scheduled Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
Actual Start Time=[Tue, 25 Mar 2025 17:00:22]
Actual Completion Time=[Tue, 25 Mar 2025 17:00:32]
Message=[SYS053: Successfully imported and applied Server Configuration Profile.]
Percent Complete=[100]
----------------------------------------------------------
[Job ID=JID_429400338344]
Job Name=Export: Server Configuration Profile
Status=Completed
Scheduled Start Time=[Not Applicable]
Expiration Time=[Not Applicable]
Actual Start Time=[Tue, 25 Mar 2025 17:00:33]
Actual Completion Time=[Tue, 25 Mar 2025 17:00:58]
Message=[SYS043: Successfully exported Server Configuration Profile]
Percent Complete=[100]

Monitor progress using run-read-command

In version 2506.2 and above, you can monitor the progress of long running Bare Metal Machine actions using a run-read-command.

  • Some long running actions such as Replace or Reimage are composed of multiple steps, for example, Hardware Validation, Deprovisioning, or Provisioning.
  • The following run-read-command shows how to view the different steps in each action, and the progress or status of each step including any potential errors.
  • This information is available on the BareMetalMachine kubernetes resource during or after the action is completed.
  • For more information about the run-read-command feature, see BareMetal Run-Read Execution.

Example run-read-command to view action progress on Bare Metal Machine rack2compute08:

az networkcloud baremetalmachine run-read-command \
  -g <ResourceGroup_Name> \
  -n <Control Node BMM Name> \
  --limit-time-seconds 60 \
  --commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" \
  --output-directory .

Example output for a Replace action:

[
  {
    "correlationId": "961a6154-4342-4831-9693-27314671e6a7",
    "endTime": "2025-05-15T21:20:44Z",
    "startTime": "2025-05-15T20:16:19Z",
    "status": "Completed",
    "stepStates": [
      {
        "endTime": "2025-05-15T20:25:51Z",
        "name": "Hardware Validation",
        "startTime": "2025-05-15T20:16:19Z",
        "status": "Completed"
      },
      {
        "endTime": "2025-05-15T20:26:21Z",
        "name": "Deprovisioning",
        "startTime": "2025-05-15T20:25:51Z",
        "status": "Completed"
      },
      {
        "endTime": "2025-05-15T21:20:44Z",
        "name": "Provisioning",
        "startTime": "2025-05-15T20:26:21Z",
        "status": "Completed"
      }
    ],
    "type": "Microsoft.NetworkCloud/bareMetalMachines/replace"
  }
]

Best practices for a Bare Metal Machine reimage

The Bare Metal Machine (BMM) reimage action is explained in Bare Metal Machine Lifecycle Management Commands and scenario procedures described in Troubleshoot Azure Operator Nexus Server Problems.

Warning

Don't run more than one baremetalmachine replace or reimage command at the same time for the same BareMetal Machine (BMM) resource. Executing replace at the same time as a reimage leaves servers in a nonoperational state. Make sure any replace/reimage on the BMM completes fully before starting another one. Additionally, avoid executing sequential reimage actions on a BMM that just completed a replace action unless specified maintenance operation is being performed.

You can restore the operating system runtime version on a Bare Metal Machine by executing the reimage operation. A Bare Metal Machine reimage can be both time-saving and reliable for resolving issues or restoring the operating system software to a known-good state. This process redeploys the runtime image on the target Bare Metal Machine and executes the steps to rejoin the cluster with the same identifiers. The reimage action is designed to interact with the operating system partition, leaving virtual machine's local storage unchanged.

Important

Avoid manual or automated changes to the Bare Metal Machine's file system (also known as "break glass"). The reimage action is required to restore Microsoft support and any changes done to the Bare Metal Machine are lost while restoring the node to its expected state.

Preconditions and validations before a Bare Metal Machine reimage

Before initiating any reimage operation, ensure the following preconditions are met:

Best practices for a Bare Metal Machine replace

The Bare Metal Machine replace action is explained in Bare Metal Machine Lifecycle Management Commands and scenario procedures described in Troubleshoot Azure Operator Nexus Server Problems.

Warning

Don't run more than one baremetalmachine replace or reimage command at the same time for the same BareMetal Machine (BMM) resource. Executing replace at the same time as a reimage leaves servers in a nonoperational state. Make sure any replace/reimage on the BMM completes fully before starting another one. Additionally, avoid executing sequential reimage actions on a BMM that just completed a replace action unless specified maintenance operation is being performed.

Hardware failures are a normal occurrence over the life of a server. Component replacements might be necessary to restore functionality and ensure continued operation. The replace operation must be executed after any hardware maintenance/repair event. When one or more hardware components fail on the server (multiple failures), make the necessary repairs for all components before executing a Bare Metal Machine replace operation.

Important

With the 2024-07-01 GA API version, the RAID controller is reset during Bare Metal Machine replace, wiping all data from the server's virtual disks. Baseboard Management Controller (BMC) virtual disk alerts triggered during Bare Metal Machine replace can be ignored unless there are more physical disk and/or RAID controllers alerts.

Preconditions and validations before a Bare Metal Machine replace

Before initiating any replace operation, ensure the following preconditions are met:

Resolve hardware validation issues

When a Bare Metal Machine is marked with failed hardware validation, it might indicate that physical repairs are needed. It's crucial to identify and address these repairs before performing a Bare Metal Machine replace. A hardware validation process is invoked as part of the replace operation to ensure the physical host's integrity before deploying the OS image. The Bare Metal Machine can't provision successfully when the Bare Metal Machine continues to have hardware validation failures. As a result, the Bare Metal Machine fails to complete the necessary setup steps to become operational and join the cluster. Ensure all hardware validation issues are cleared before the next replace action.

To understand hardware validation result, read through the article Troubleshoot Hardware Validation Failure.

Bare Metal Machine replace isn't required

Some repairs don't require a Bare Metal Machine replace to be executed. For example, a replace operation isn't required when you're performing a physical hot swappable power supply repair because the Bare Metal Machine host will continue to function normally after the repair. However, if the Bare Metal Machine failed hardware validation, the Bare Metal Machine replace is required even if the hot swappable repairs are done. Examine the Bare Metal Machine status messages to determine if hardware validation failures or other degraded conditions are present.

Other repairs of this type might be:

  • CPU
  • Dual In-Line Memory Module (DIMM)
  • Fan
  • Expansion board riser
  • Transceiver
  • Ethernet or fiber cable replacement

Bare Metal Machine replace is required

After components such as motherboard or Network Interface Card (NIC) are replaced, the Bare Metal Machine MAC address changes. However, the iDRAC IP address and hostname remain the same. Motherboard changes result in MAC address changes, requiring a Bare Metal Machine replace.

A replace operation is required to bring the Bare Metal Machine back into service when you're performing the following physical repairs:

  • Backplane
  • System board
  • SSD disk
  • PERC/RAID adapter
  • Mellanox Network Interface Card (NIC)
  • Broadcom embedded NIC

Check statuses after a Bare Metal Machine replace operation

After the Bare Metal Machine replace operation completes successfully, ensure that the provisioningStatus is Succeeded and the readyState is True. Only then, proceed to execute the uncordon operation to have the Bare Metal Machine rejoin the workload schedulable node pool.

Request support

If you still have questions, contact support. For more information about Support plans, see Azure Support plans.

References