Azure Operator Nexus observability metrics

Article
03/20/2024

In Operator Nexus Network Fabric (NNF), Ethernet monitoring is a critical component in maintaining optimal network performance, ensuring availability, and proactively addressing potential issues before they cause disruptions in the fabric. Monitoring includes traffic analysis, device health, security, and details specific to individual Ethernet interfaces. By closely monitoring the fabric infrastructure, we can ensure that NNF operates smoothly and efficiently, and that any potential problems are identified and addressed early on.

The following aspects of NNF devices are monitored:

Availability: Monitoring the connectivity of devices ensures that the network is available and prevents downtime
Performance: Tracking metrics such as interface bandwidth utilization, packet loss, latency, and jitter, lets us evaluate network performance and pinpoint any bottlenecks
Security: Monitoring helps to identify any suspicious activity, unauthorized access attempts, or potential security threats on the network
Health: Monitoring device CPU, memory, temperature, fan, power supply status, and interface operational status, lets us identify any potential failures

ACL state counters

State counters for Access Control Lists (ACLs) in a network device help you oversee and control network traffic. They offer data on the number of packets that matched to each ACL entry. These counters can be examined on a global scale, or per interface, and by incoming and outgoing traffic.

Metrics Category	Description/Usage	Collection interval	Measure unit
ACL (Access List) Matched Packets	The total count of network packets that match the criteria set by the current Access Control List (ACL) entry in a network device. This count helps in monitoring and managing network traffic.	5 min	Number of packets.

BGP status

Border Gateway Protocol (BGP) connections are essential to effective communication between BGP peers, and optimal network performance. Network administrators can detect network problems or disruptions by observing these states. For example, a connection remaining in the 'Idle' state could suggest a configuration problem. The 'Established' state, which indicates a successful routing information exchange between BGP peers, is essential for the network to function correctly.

Metrics Category	Description/Usage	Collection interval	Measured unit
BGP Peer Status	The BGP peer status, as defined by RFC 4271, and summarized after this table.	5 mins and on demand	N/A

The BGP connection states are:

Idle (1): The initial state of a BGP connection.
Connect (2): The system is waiting for the TCP connection to be completed.
Active (3): The system is trying to initiate a TCP connection with the peer.
OpenSent (4): The system is waiting to receive an OPEN message from the peer.
OpenConfirm (5): The system is waiting for a KEEPALIVE or NOTIFICATION message from the peer.
Established (6): The BGP connection is fully established and the peers can exchange UPDATE messages.

Component operational state

The operational state of a hardware or software component shows its current functioning state.

Metrics Category	Description/Usage	Collection interval	Measured unit
Component Operation Status	Operational Status of the entities that can be part of the device's inventory, such as line cards, transceivers, fans, power supplies, etc. The possible values are described after this table.	5 mins and on demand	N/A

The possible operational states are:

Active (0): The component is enabled and active (up)
Inactive (1): The component is enabled but inactive (down)
Disabled (2): The component is administratively disabled

Interface operational state

The operational state of an interface in a network device shows its current functioning state.

Metrics Category	Description/Usage	Collection interval	Measured unit
Interface Operational State	The operational state of the interface. The possible values are described after this table.	5 mins	N/A

The possible operational states are:

Up (0): The interface is operational and able to transmit and receive data
Down (1): The interface isn't operational, and is unable to transmit or receive data
Lower_layer_down (2): The interface is down due to a failure in a lower layer of the network stack
Testing (3): The interface is undergoing testing and isn't yet operational
Unknown (4): The status of the interface is unknown, possibly due to a failure in the device's monitoring system
Dormant (5): The interface is operational but is currently in a dormant state, meaning it isn't transmitting or receiving data
Not_present (6): The interface isn't present in the device, possibly because it has been physically removed or it hasn't been installed yet

Interface state counters

Interface state counters track the number of frames or packets matching certain conditions. All of them are collected at 5-minute intervals.

Metrics Category	Description/Usage
Ethernet Interface In CRC Errors	The total count of received frames with lengths between 64 and 1,518 octets that have either an FCS Error or an Alignment Error. These errors indicate issues in data transmission that need to be addressed for reliable communication.
Ethernet Interface In Fragment Frames	This is a gauge that quantifies the count of error-ridden fragment frames received via an ethernet interface. A fragment frame, which is shorter than the Ethernet protocol's stipulated length, possesses either an incorrect checksum value, known as an FCS Error, or an improper bit count, referred to as an Alignment Error.
Ethernet Interface In Jabber Frames	The count of jabber frames received via the interface. Jabber frames are frames that exceed the standard size and also possess an erroneous Cyclic Redundancy Check (CRC).
Ethernet Interface In MAC Control Frames	Control frames at the MAC layer received on the interface.
Ethernet Interface In MAC Pause Frames	PAUSE frames at the MAC layer received via the interface.
Ethernet Interface In Maxsize Exceeded	The total count of well-structured frames that were dropped on the interface due to surpassing the maximum frame size.
Ethernet Interface In Oversize Frames	The total count of well-structured frames received that exceeded 1,518 octets in length (not counting framing bits, but inclusive of FCS octets).
Ethernet Interface Out MAC Control Frames	Control frames at the MAC layer sent on the interface
Ethernet Interface Out MAC Pause Frames	PAUSE frames at the MAC layer sent via the interface.
Interface In Broadcasts packets	The total number of packets addressed to a broadcast address at this sublayer, including those packets that were discarded or not sent.
Interface In Discards	The number of inbound packets that were discarded even though no errors were detected to prevent their being deliverable to a higher-layer protocol.
Interface In Errors	For packet-oriented interfaces, the number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol.
Interface In FCS Errors	Number of received packets that had errors in the frame check sequence (FCS).
Interface In Multicast Packets	The number of packets delivered by this sublayer to a higher layer or sublayer that were addressed to a multicast address at this sublayer. For a MAC-layer protocol, these addresses include both Group and Functional addresses.
Interface In Octets	The total number of octets received on the interface, including framing characters.
Interface In Packets	The total number of packets received on the interface, including all unicast, multicast, broadcast, and bad packets.
Interface In Unicasts Packets	The number of packets delivered by this sublayer to a higher layer or sublayer that weren't addressed to a multicast or broadcast address at this sublayer.
Interface Out Broadcast Packets	The total number of packets that were addressed to a broadcast address at this sublayer, including packets that were discarded or not sent.
Interface Out Discards	The number of outbound packets that were discarded even though no errors were detected to prevent their being transmitted.
Interface Out Errors	For packet-oriented interfaces, the number of outbound packets that couldn't be transmitted because of errors.
Interface Out Multicast Packets	The total number of outgoing packets that were addressed to a multicast address at this sublayer, including packets that were discarded or not sent. For a MAC-layer protocol, these addresses include both Group and Functional addresses.
Interface Out Octets	The total number of octets transmitted out of the interface, including framing characters.
Interface Out Packets	The total number of packets transmitted out of the interface, including all unicast, multicast, broadcast, and bad packets.
Interface Out Unicasts Packets	The total number of outgoing packets that weren't addressed to a multicast or broadcast address at this sublayer, including packets that were discarded or not sent.

Interface state rate

The interface state rate can have effects on network performance and dependability. A high frequency of state alterations could signal an unstable network connection, potentially resulting in diminished performance and network congestion.

All of the state rate measurements are collected at 5-minute intervals.

Metrics Category	Description/Usage
Interface In Discards Rate	The rate at which incoming packets are being discarded on a network interface. Packets can be discarded for various reasons such as network congestion, faulty hardware, or configuration issues.
Interface In Packets Rate	The rate of packets received on the interface, including all unicasts, multicasts, broadcasts, and bad packets.
Interface Out Discards Rate	The rate of outbound packets that were discarded even though no errors were detected to prevent them being transmitted.
Interface Out Packets Rate	The rate of packets transmitted out of the interface, including all unicasts, multicasts, broadcasts, and bad packets.

LACP state rate

Monitoring the LACP state rate is vital due to its potential effects on network performance and dependability. The term "LACP state rate" in the context of Link Aggregation Control Protocol denotes the pace at which LACP control packets are transmitted or received by an interface that supports LACP.

All of the state rate measurements are collected at 5-minute intervals.

Metrics Category	Description/Usage
Lacp Errors	The term "Number of LACPDU illegal packet errors" signifies the tally of Link Aggregation Control Protocol Data Units (LACPDUs) that, despite being received, are deemed unlawful due to their incorrectly formed structure or an unauthorized value of Protocol Subtype.
Lacp In Packets	Number of LACPDUs received.
Lacp Out Packets	Number of LACPDUs transmitted.
Lacp RX errors	Number of LACPDUs errors received.
Lacp TX errors	Number of LACPDUs errors transmitted.
Lacp unknown Errors	Number of LACPDUs unknown errors.

LLDP state counters

LLDP state counters show how many LLDP frames have been sent and received by a network device. LLDP stands for Link Layer Discovery Protocol, which is a standard protocol that allows devices to advertise their identity, capabilities, and neighbors on a local area network. LLDP state counters can help network administrators monitor the health and performance of the network, troubleshoot connectivity issues, and discover the topology and configuration of the devices.

All of the measurements are collected at 5-minute intervals.

Metrics Category	Description/Usage
Lldp Frame in	LLDPFrameIn in a network device refers to the number of Link Layer Discovery Protocol (LLDP) frames that the device received.
Lldp Frame out	LLDPFrameOut in a network device refers to the number of Link Layer Discovery Protocol (LLDP) frames that the device sent. LLDP is used by network devices to advertise their identity and capabilities to other devices on the same network.
Lldp TLV unknown	LLDPTLVUnknown in a network device refers to the number of Link Layer Discovery Protocol (LLDP) frames received that contain unknown Type-Length-Value (TLV) entries. TLVs are used in network protocols to specify optional information. An "unknown" TLV suggests the device received data that it doesn't recognize or can't interpret, which could indicate compatibility issues within the network.

Network fabric device resource utilization

Resource utilization metrics provide critical insights into how efficiently network resources are being used. These metrics provide insights into the performance and health of a network fabric device. The resource utilization metrics provide a holistic view of a system's performance and health. They measure CPU workload, cooling efficiency, memory availability, power performance, and heat levels. These metrics are essential for optimizing system performance, managing resources effectively, and preventing hardware damage due to excessive heat.

Metrics Category	Description/Usage	Collection Interval	Measured Unit
CPU Utilization Avg	The CPU Utilization Average Metric is the mean percentage of computing resources used by a processor over a specific time interval.	1 min	Percentage
CPU Utilization Instant	The immediate percentage of computing resources being used by a processor at a specific time. It provides real-time insight into the CPU's workload and performance.	1 min	Percentage
CPU Utilization Max	The highest percentage of computing resources used by a processor over a specific time interval.	1 min	Percentage
CPU Utilization Min	The lowest percentage of computing resources used by a processor over a given time interval.	1 min	Percentage
Fan Speed	The immediate rate at which the cooling fan is spinning. It's a crucial metric for maintaining optimal operating temperature and ensuring the longevity of the device's components.	1 min	Rpm (not Available)
Memory Available	The available memory physically installed, or logically allocated to the component.	1 min	Bytes
Power Supply Input Current	The amount of electrical current, measured in amps, that the power supply unit (PSU) draws from the source. It's a crucial factor in determining the PSU's efficiency and the overall power consumption of the device.	1 min	Amps (not Available)
Power supply Input Voltage	The amount of electrical potential, measured in volts, that the power supply unit (PSU) draws from the source. It's a key parameter in ensuring the PSU can adequately convert AC power to DC power for the device's needs.	1 min	Volts (Not Available)
Power Supply Max Power Capacity	Maximum power capacity of the power supply.	1 min	Watts (Not Available)
Power Supply Output current	The output current supplied by the power supply.	1 min	Amps (Not available)
Power Supply Output Voltage	The output voltage supplied by the power supply.	1 min	Volts (not available)
Power Supply Output Power	The amount of electrical power, measured in watts, that the power supply unit (PSU) delivers to the device's components. It's a critical factor in ensuring the device has sufficient power for optimal performance.	1 min
Temperature Instantaneous	The real-time temperature of the device's components.	1 min
Temperature Max	The highest safe operating temperature for the device's components. Exceeding this limit can lead to overheating, which might cause performance issues, component damage, or even lead to device failure. It's crucial to monitor and manage the device's temperature to ensure its longevity and optimal performance.	1 min

Microsoft Learn Challenge

Share via