Edit

Share via


Replace a network device in Azure Operator Nexus Network Fabric (NNF)

This article explains how to replace a faulty or underperforming network device in Azure Operator Nexus Network Fabric (NNF). It covers devices such as the Top of Rack (TOR) switch, Customer Edge (CE) switch, Network Packet Broker (NPB), and the Management Switch. The replacement is performed using the Return Material Authorization (RMA) process. This process is designed to minimize service disruption and safely reintegrate the new hardware into the fabric.

Scenarios for device replacement

Device replacement may be required in the following situations:

  • Inconsistent Performance (Flakiness): The device shows intermittent connectivity or performance degradation.

  • Hardware Failure: The device experiences critical hardware malfunctions that can't be fixed through standard troubleshooting.

  • Persistent Unreachability: The device is permanently unreachable despite repeated recovery attempts.

Prerequisites

To ensure a smooth and timely RMA process, verify the following prerequisites before initiating deployment:

  • Azure CLI is installed and properly configured

  • Permissions are granted to manage Microsoft.ManagedNetworkFabric resources

  • Replacement device is powered on and physically connected

  • Replacement device supports Zero Touch Provisioning (ZTP)

  • To prevent failure during the device disable action if the device is affected by continuous reboots due to hardware issues, it is advised to power off the device prior to initiating the RMA process.

  • Before initiating the RMA deployment, perform the following checks:

    • Interface Speed Validation

      • Confirm that the ma1 interface speed is set to 100 Mbps or higher.

      • If the speed is below 100 Mbps, update it accordingly to prevent delays or potential timeouts during the RMA process.

    • Device Storage Check

      • Ensure the device has a minimum of 3 GB of free space available.

      • This action is required to successfully download and stage the necessary image files.

Device types supported

  • Customer Edge (CE)
  • Top of Rack (TOR)
  • Management Switch (Mgmt Switch)
  • Network Packet Broker (NPB)

Note

Please note that this workflow supports RMA for only one device at a time through POST actions, which accept input related to a single device per request.

Steps to replace a device

Step 1: Disable administrative state

Use the following command to disable the administrative state of the device:

az networkfabric device update-admin-state \
  --state Disable \
  --resource-name "nf-device-name" \
  --resource-group "resource-group-name"

This action sets the following states:

  • Device Administrative State: Disabled

  • Fabric Administrative State: EnabledDegraded

Note

This action is not permitted by the service, if any of the following operations are in progress at the fabric level:

  • Device upgrade
  • Configuration push
  • Secret or certificate updates
  • Administrative lock
  • Terminal Server (TS) reprovisioning.

Step 2: Update the serial number

Execution conditions:

  • Device Administrative State must be Disabled
  • Fabric Administrative State must be EnabledDegraded

Once the replacement device is physically installed, update its serial number in the fabric resource:

az networkfabric device update \
  --serial-number "replacement-serial-number" \
  --resource-name "nf-device-name" \
  --resource-group "resource-group-name"

Error recovery guidance:

  • If RMA fails due to an incorrect serial number, repatching is allowed without a support ticket.

  • If validation fails after device bootstrap, the system returns the status: Device Unable to Boot Up - Failed.

This action performs the following tasks:

  • Update serial number stored in Azure ARM resource

  • Keeps the device in Disabled state and Fabric Administrative State in EnabledDegraded

Note

The expected format for the serial number should be: <Manufacturer;Model;Hardware Version;Serial Number>
For example, "Arista;DCS-7280DR3-XX;12.05;ABC23XXXXXX"

Step 3: Ensure device is in ZTP Mode

Verify that the replacement device is in ZTP mode. If not, configure the device for ZTP before continuing.

Note

ZTP enables automatic configuration retrieval during the RMA process.

Step 4: Initiate RMA process

Initiate the RMA process using the following command:

az networkfabric device update-admin-state \
  --state RMA \
  --resource-name "nf-device-name" \
  --resource-group "resource-group-name"
  • Network Fabric Controller pushes all required configuration files to the new replaced device. It is advised to retry the operation if there's transient failures until success is confirmed.

  • The device boots into its base configuration using the maintenance profile. This condition applies only to TOR and CE device types.

This action sets the following states:

  • Device Administrative State: UnderMaintenance

  • Fabric Administrative State: EnabledDegraded

Step 5: Refresh configuration

This operation pushes the latest configuration to the device (for all type of the devices). If a maintenance profile is already configured on the device (applicable to CE and TOR), it will be retained during this operation.

az networkfabric device refresh-configuration --resource-name <resource-name> --resource-group <rg-name>

This action keeps the device in following states:

  • Device Administrative State: UnderMaintenance

  • Fabric Administrative State: EnabledDegraded

Step 6: Enable administrative state.

Once configuration is applied successfully, bring the device back into active service:

az networkfabric device update-admin-state \
  --state Enable \
  --resource-name "nf-device-name" \
  --resource-group "resource-group-name"

This action sets the following state once it's fully healthy and synchronized with the fabric:

  • Device Administrative State: Enabled

  • Fabric Administrative State: Enabled

Note

In a given fabric if there are any other device is in Disabled state then the Fabric Administrative State will maintained as : EnabledDegraded

Summary

The RMA workflow in Network Fabric ensures seamless device replacement with controlled state transitions and full configuration synchronization. This helps maintain service continuity and operational consistency across the network.

Permitted/Non-Permitted Actions When Fabric is in Enabled Degraded State

When the fabric is in an Enabled Degraded State, certain operations are permitted while others should be deferred or handled with caution. Please refer to the below list:


Following are the Permitted Operations

Operation Category Examples (APIs / CLI) Allowed? Notes / Recommended Practice
READ (non-mutating) GET/List, Show for Fabric / Devices / ISDs / Networks; metrics & health Allowed Safe to monitor state, validate results, and track onboarding
RMA Device Replacement Actions Disable + Update Serial + RMA + Refresh Config + Enable Allowed Follow standard Replace Device guide steps
Commit (configuration apply) Start / Monitor Commit (Commit Workflow v2) Allowed Configurations pushed to all devices except those in Disabled state
VALIDATE (pre-flight checks) Validate configuration / dry-runs Allowed Useful to catch issues before commit
Administrative Lock / Unlock Lock / Unlock fabric Allowed No restrictions in this state

Operation Category Examples (APIs / CLI) Allowed? Notes / Recommended Practice
CREATE/UPDATE config (non-RMA config) Add/Change ISDs, Networks, Route Policies, Prefs/vias, Taps, Communities Technically allowed but defer if possible Configuration won’t reach Disabled devices until RMA completes. Once device RMA completed all latest configuration will be pushed to devices.
DELETE (fabric config) Remove ISDs, Networks, Policies, Taps Technically allowed but defer if possible Disabled devices may retain removed config until re-enabled.

Non-permitted Operations

Operation Category Examples (APIs/CLI) Allowed? Notes / Recommended Practice
Upgrades Fabric/Device runtime upgrades Not permitted Schedule upgrades after RMA completes and fabric returns to Enabled
Secret rotation Geneva action Not permitted TS reprovisioning and device RMA would be treated as mutually exclusive operations. If one is active the other cannot be initiated.