Question 1

What are the platform requirements for the nodes to be used for monitoring by NPM?

Accepted Answer

Listed below are the platform requirements for NPM's various capabilities:

NPM's Performance Monitor and Service Connectivity Monitor capabilities support both Windows server and Windows desktops/client operating systems. Windows server OS versions supported are 2008 SP1 or later. Windows desktops/client versions supported are Windows 10, Windows 8.1, Windows 8, and Windows 7.
NPM's ExpressRoute Monitor capability supports only Windows server (2008 SP1 or later) operating system.

Question 2

Can I use machines as monitoring nodes in NPM?

Accepted Answer

The capability to monitor networks using Linux-based nodes is now generally available. Access the agent here.

Question 3

What are the size requirements of the nodes to be used for monitoring by NPM?

Accepted Answer

For running the NPM solution on node VMs to monitor networks, the nodes should have at least 500-MB memory and one core. You don't need to use separate nodes for running NPM. The solution can run on nodes that have other workloads running on it. The solution has the capability to stop the monitoring process if it uses more than 5% CPU.

Question 4

To use NPM, should I connect my nodes as Direct agent or through System Center Operations Manager?

Accepted Answer

Both the Performance Monitor and the Service Connectivity Monitor capabilities support nodes connected as Direct Agents and connected through Operations Manager.

For ExpressRoute Monitor capability, the Azure nodes should be connected as Direct Agents only. Azure nodes, which are connected through Operations Manager are not supported. For on-premises nodes, the nodes connected as Direct Agents and through Operations Manager are supported for monitoring an ExpressRoute circuit.

Question 5

Which protocol among TCP and ICMP should be chosen for monitoring?

Accepted Answer

If you're monitoring your network using Windows server-based nodes, we recommend you use TCP as the monitoring protocol since it provides better accuracy.

ICMP is recommended for Windows desktops/client operating system-based nodes. This platform doesn't allow TCP data to be sent over raw sockets, which NPM uses to discover network topology.

You can get more details on the relative advantages of each protocol here.

Question 6

How can I configure a node to support monitoring using TCP protocol?

Accepted Answer

For the node to support monitoring using TCP protocol:

Ensure that the node platform is Windows Server (2008 SP1 or later).
Run EnableRules.ps1 PowerShell script on the node. See instructions for more details.

Question 7

How can I change the TCP port being used by NPM for monitoring?

Accepted Answer

You can change the TCP port used by NPM for monitoring, by running the EnableRules.ps1 script. You need enter the port number you intend to use as a parameter. For example, to enable TCP on port 8060, run EnableRules.ps1 8060. Ensure that you use the same TCP port on all the nodes being used for monitoring.

The script configures only Windows Firewall locally. If you have network firewall or Network Security Group (NSG) rules, make sure that they allow the traffic destined for the TCP port used by NPM.

Question 8

How many agents should I use?

Accepted Answer

You should use at least one agent for each subnet that you want to monitor.

Question 9

What is the maximum number of agents I can use or I see error ".... you've reached your Configuration limit"?

Accepted Answer

NPM limits the number of IPs to 5000 IPs per workspace. If a node has both IPv4 and IPv6 addresses, this will count as 2 IPs for that node. Hence, this limit of 5000 IPs would decide the upper limit on the number of agents. You can delete the inactive agents from Nodes tab in NPM >> Configure. NPM also maintains history of all the IPs that were ever assigned to the VM hosting the agent and each is counted as separate IP contributing to that upper limit of 5000 IPs. To free up IPs for your workspace, you can use the Nodes page to delete the IPs that are not in use.

Question 10

How are loss and latency calculated

Accepted Answer

Source agents send either TCP SYN requests (if TCP is chosen as the protocol for monitoring) or ICMP ECHO requests (if ICMP is chosen as the protocol for monitoring) to destination IP at regular intervals to ensure that all the paths between the source-destination IP combination are covered. The percentage of packets received and packet round-trip time is measured to calculate the loss and latency of each path. This data is aggregated over the polling interval and over all the paths to get the aggregated values of loss and latency for the IP combination for the particular polling interval.

Question 11

With what frequency does the source agent send packets to the destination for monitoring?

Accepted Answer

For Performance Monitor and ExpressRoute Monitor capabilities, the source sends packets every 5 seconds and records the network measurements. This data is aggregated over a 3-minute polling interval to calculate the average and peak values of loss and latency. For Service Connectivity Monitor capability, the frequency of sending the packets for network measurement is determined by the frequency entered by the user for the specific test while configuring the test.

Question 12

How many packets are sent for monitoring?

Accepted Answer

The number of packets sent by the source agent to destination in a polling is adaptive and is decided by our proprietary algorithm, which can be different for different network topologies. More the number of network paths between the source-destination IP combination, more is the number of packets that are sent. The system ensures that all paths between the source-destination IP combination are covered.

Question 13

How does NPM discover network topology between source and destination?

Accepted Answer

NPM uses a proprietary algorithm based on Traceroute to discover all the paths and hops between source and destination.

Question 14

Does NPM provide routing and switching level info

Accepted Answer

Though NPM can detect all the possible routes between the source agent and the destination, it does not provide visibility into which route was taken by the packets sent by your specific workloads. The solution can help you identify the paths and underlying network hops, which are adding more latency than you expected.

Question 15

Why are some of the paths unhealthy?

Accepted Answer

Different network paths can exist between the source and destination IPs and each path can have a different value of loss and latency. NPM marks those paths as unhealthy (denoted with red color) for which the values of loss and/or latency is greater than the respective threshold set in the monitoring configuration.

Question 16

What does a hop in red color signify in the network topology map?

Accepted Answer

If a hop is red, it signifies that it is part of at-least one unhealthy path. NPM only marks the paths as unhealthy, it does not segregate the health status of each path. To identify the troublesome hops, you can view the hop-by-hop latency and segregate the ones adding more than expected latency.

Question 17

How does fault localization in Performance Monitor work?

Accepted Answer

NPM uses a probabilistic mechanism to assign fault-probabilities to each network path, network segment, and the constituent network hops based on the number of unhealthy paths they are a part of. As the network segments and hops become part of more number of unhealthy paths, the fault-probability associated with them increases. This algorithm works best when you have many nodes with NPM agent connected to each other as this increases the data points for calculating the fault-probabilities.

Question 18

What are the default Log Analytics queries for alerts

Accepted Answer

Performance monitor query

NetworkMonitoring
 | where (SubType == "SubNetwork" or SubType == "NetworkPath") 
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy") and RuleName == "<>"

Service connectivity monitor query

NetworkMonitoring
 | where (SubType == "EndpointHealth" or SubType == "EndpointPath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or ServiceResponseHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy") and TestName == "<>"

ExpressRoute monitor queries: Circuits query

NetworkMonitoring
 | where (SubType == "ERCircuitTotalUtilization") and (UtilizationHealthState == "Unhealthy") and CircuitResourceId == "<>"

Private peering

NetworkMonitoring
 | where (SubType == "ExpressRoutePeering" or SubType == "ERVNetConnectionUtilization" or SubType == "ExpressRoutePath")   
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitName == "<>" and VirtualNetwork == "<>"

Microsoft peering

NetworkMonitoring
 | where (SubType == "ExpressRoutePeering" or SubType == "ERMSPeeringUtilization" or SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitName == ""<>" and PeeringType == "MicrosoftPeering"

Common query

NetworkMonitoring
 | where (SubType == "ExpressRoutePeering" or SubType == "ERVNetConnectionUtilization" or SubType == "ERMSPeeringUtilization" or SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy")

Question 19

Can NPM monitor routers and servers as individual devices?

Accepted Answer

NPM only identifies the IP and host name of underlying network hops (switches, routers, servers, etc.) between the source and destination IPs. It also identifies the latency between these identified hops. It does not individually monitor these underlying hops.

Question 20

Can NPM be used to monitor network connectivity between Azure and AWS?

Accepted Answer

Yes. Refer to the article Monitor Azure, AWS, and on-premises networks using NPM for details.

Question 21

Is the ExpressRoute bandwidth usage incoming or outgoing?

Accepted Answer

Bandwidth usage is the total of incoming and outgoing bandwidth. It is expressed in Bits/sec.

Question 22

Can we get incoming and outgoing bandwidth information for the ExpressRoute?

Accepted Answer

Incoming and outgoing values for both Primary and Secondary bandwidth can be captured.

For MS peering level information, use the below mentioned query in Log Search

NetworkMonitoring
 | where SubType == "ERMSPeeringUtilization"
 | project CircuitName,PeeringName,BitsInPerSecond,BitsOutPerSecond

For private peering level information, use the below mentioned query in Log Search

NetworkMonitoring
 | where SubType == "ERVNetConnectionUtilization"
 | project CircuitName,PeeringName,BitsInPerSecond,BitsOutPerSecond

For circuit level information, use the below mentioned query in Log Search

NetworkMonitoring
 | where SubType == "ERCircuitTotalUtilization"
 | project CircuitName, BitsInPerSecond, BitsOutPerSecond

Question 23

Which regions are supported for NPM's Performance Monitor?

Accepted Answer

NPM can monitor connectivity between networks in any part of the world, from a workspace that is hosted in one of the supported regions

Question 24

Which regions are supported for NPM's Service Connectivity Monitor?

Accepted Answer

NPM can monitor connectivity to services in any part of the world, from a workspace that is hosted in one of the supported regions

Question 25

Which regions are supported for NPM's ExpressRoute Monitor?

Accepted Answer

NPM can monitor your ExpressRoute circuits located in any Azure region. To onboard to NPM, you will require a Log Analytics workspace that must be hosted in one of the supported regions

Question 26

Why are some of the hops marked as unidentified in the network topology view?

Accepted Answer

NPM uses a modified version of traceroute to discover the topology from the source agent to the destination. An unidentified hop represents that the network hop did not respond to the source agent's traceroute request. If three consecutive network hops do not respond to the agent's traceroute, the solution marks the unresponsive hops as unidentified and does not try to discover more hops.

A hop may not respond to a traceroute in one or more of the below scenarios:

The routers have been configured to not reveal their identity.
The network devices are not allowing ICMP_TTL_EXCEEDED traffic.
A firewall is blocking the ICMP_TTL_EXCEEDED response from the network device.

When either of the endpoints lies in Azure, traceroute shows up unidentified hops as Azure Infrastructure does not reveal identity to traceroute.

Question 27

I get alerts for unhealthy tests but I do not see the high values in NPM's loss and latency graph. How do I check what is unhealthy?

Accepted Answer

NPM raises an alert if end to end latency between source and destination crosses the threshold for any path between them. Some networks have multiple paths connecting the same source and destination. NPM raises an alert is any path is unhealthy. The loss and latency seen in the graphs is the average value for all the paths, hence it may not show the exact value of a single path. To understand where the threshold has been breached, look for the "SubType" column in the alert. If the issue is caused by a path the SubType value will be NetworkPath (for Performance Monitor tests), EndpointPath (for Service Connectivity Monitor tests) and ExpressRoutePath (for ExpressRotue Monitor tests).

Sample Query to find is path is unhealthy:

NetworkMonitoring
 | where ( SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitResourceID =="<your ER circuit ID>" and ConnectionResourceId == "<your ER connection resource id>"
 | project SubType, LossHealthState, LatencyHealthState, MedianLatency

Sample Query to find is path is unhealthy:

NetworkMonitoring
 | where ( SubType == "ExpressRoutePath")
 | where (LossHealthState == "Unhealthy" or LatencyHealthState == "Unhealthy" or UtilizationHealthState == "Unhealthy") and CircuitResourceID =="" and ConnectionResourceId == ""
 | project SubType, LossHealthState, LatencyHealthState, MedianLatency

Question 28

Why does my test show unhealthy but the topology does not

Accepted Answer

NPM monitors end-to-end loss, latency, and topology at different intervals. Loss and latency are measured once every 5 seconds and aggregated every three minutes (for Performance Monitor and Express Route Monitor) while topology is calculated using traceroute once every 10 minutes. For example, between 3:44 and 4:04, topology may be updated three times (3:44, 3:54, 4:04), but loss and latency are updated about seven times (3:44, 3:47, 3:50, 3:53, 3:56, 3:59, 4:02). The topology generated at 3:54 will be rendered for the loss and latency that gets calculated at 3:56, 3:59 and 4:02. Suppose you get an alert that your ER circuit was unhealthy at 3:59. You log on to NPM and try to set the topology time to 3:59. NPM will render the topology generated at 3:54. To understand the last known topology of your network, compare the fields TimeProcessed (time at which loss and latency was calculated) and TracerouteCompletedTime(time at which topology was calculated).

Question 29

What is the difference between the fields E2EMedianLatency and AvgHopLatencyList in the NetworkMonitoring table

Accepted Answer

E2EMedianLatency is the latency updated every three minutes after aggregating the results of tcp ping tests, whereas AvgHopLatencyList is updated every 10 mins based on traceroute. To understand the exact time at which E2EMedianLatency was calculated, use the field TimeProcessed. To understand the exact time at which traceroute completed and updated AvgHopLatencyList, use the field TracerouteCompletedTime

Question 30

Why does hop-by-hop latency numbers differ from HopLatencyValues

Accepted Answer

HopLatencyValues are source to endpoint. For Example: Hops - A,B,C. AvgHopLatency - 10,15,20. This means source to A latency = 10, source to B latency = 15 and source to C latency is 20. UI will calculate A-B hop latency as 5 in the topology

Question 31

The solution shows 100% loss but there is connectivity between the source and destination

Accepted Answer

This can happen if either the host firewall or the intermediate firewall (network firewall or Azure NSG) is blocking the communication between the source agent and the destination over the port being used for monitoring by NPM (by default the port is 8084, unless the customer has changed this).

To verify that the host firewall is not blocking the communication on the required port, view the health status of the source and destination nodes from the following view: Network Performance Monitor -> Configuration -> Nodes. If they are unhealthy, view the instructions and take corrective action. If the nodes are healthy, move to the step b. below.
To verify that an intermediate network firewall or Azure NSG is not blocking the communication on the required port, use the third-party PsPing utility using the below instructions:
- psping utility is available for download here
- Run following command from the source node.
  - psping -n 15 :portNumber By default NPM uses 8084 port. In case you have explicitly changed this by using the EnableRules.ps1 script, enter the custom port number you are using). This is a ping from Azure machine to on-premises
Check if the pings are successful. If not, then it indicates that an intermediate network firewall or Azure NSG is blocking the traffic on this port.
Now, run the command from destination node to source node IP.

Question 32

There is loss from node A to B, but not from node B to A. Why?

Accepted Answer

As the network paths between A to B can be different from the network paths between B to A, different values for loss and latency can be observed.

Question 33

Why are all my ExpressRoute circuits and peering connections not being discovered?

Accepted Answer

NPM now discovers ExpressRoute circuits and peering connections in all subscriptions to which the user has access. Choose all the subscriptions where your Express Route resources are linked and enable monitoring for each discovered resource. NPM looks for connection objects when discovering a private peering, so please check if a VNET is associated with your peering. NPM does not detect circuits and peering that are in a different tenant from the Log Analytics workspace.

Question 34

The ER Monitor capability has a diagnostic message "Traffic is not passing through ANY circuit". What does that mean?

Accepted Answer

There can be a scenario where there is a healthy connection between the on-premises and Azure nodes but the traffic is not going over the ExpressRoute circuit configured to be monitored by NPM.

This can happen if:

The ER circuit is down.
The route filters are configured in such a manner that they give priority to other routes (such as a VPN connection or another ExpressRoute circuit) over the intended ExpressRoute circuit.
The on-premises and Azure nodes chosen for monitoring the ExpressRoute circuit in the monitoring configuration, do not have connectivity to each other over the intended ExpressRoute circuit. Ensure that you have chosen correct nodes that have connectivity to each other over the ExpressRoute circuit you intend to monitor.

Question 35

Why does ExpressRoute Monitor report my circuit/peering as unhealthy when it is available and passing data.

Accepted Answer

ExpressRoute Monitor compares the network performance values (loss, latency and bandwidth utilization) reported by the agents/service with the thresholds set during Configuration. For a circuit, if the bandwidth utilization reported is greater than the threshold set in Configuration, the circuit is marked as unhealthy. For peerings, if the loss, latency or bandwidth utilization reported is greater than the threshold set in the Configuration, the peering is marked as unhealthy. NPM does not utilize metrics or any other form of data to decide health state.

Question 36

Why does ExpressRoute Monitor'bandwidth utilization report a value different from metrics bits in/out

Accepted Answer

For ExpressRoute Monitor, bandwidth utilization is the average of incoming and outgoing bandwidth over the last 20 mins It is expressed in Bits/sec. For Express Route metrics, bit in/out are per minute data points.Internally the dataset used for both is the same, but the aggregation varies between NPM and ER metrics. For granular, minute by minute monitoring and fast alerts, we recommend setting alerts directly on ER metrics

Question 37

While configuring monitoring of my ExpressRoute circuit, the Azure nodes are not being detected.

Accepted Answer

This can happen if the Azure nodes are connected through Operations Manager. The ExpressRoute Monitor capability supports only those Azure nodes that are connected as Direct Agents.

Question 38

I cannot Discover by ExpressRoute circuits in the OMS portal

Accepted Answer

Though NPM can be used both from the Azure portal as well as the OMS portal, the circuit discovery in the ExpressRoute Monitor capability works only through the Azure portal. Once the circuits are discovered through the Azure portal, you can then use the capability in either of the two portals.

Question 39

In the Service Connectivity Monitor capability, the service response time, network loss, as well as latency are shown as NA

Accepted Answer

This can happen if one or more is true:

The service is down.
The node used for checking network connectivity to the service is down.
The target entered in the test configuration is incorrect.
The node doesn't have any network connectivity.

Question 40

In the Service Connectivity Monitor capability, a valid service response time is shown but network loss as well as latency are shown as NA

Accepted Answer

This can happen if one or more is true:

If the node used for checking network connectivity to the service is a Windows client machine, either the target service is blocking ICMP requests or a network firewall is blocking ICMP requests that originate from the node.
The Perform network measurements check box is blank in the test configuration.

Question 41

In the Service Connectivity Monitor capability, the service response time is NA but network loss as well as latency are valid

Accepted Answer

This can happen if the target service is not a web application but the test is configured as a Web test. Edit the test configuration, and choose the test type as Network instead of Web.

Question 42

Is there a performance impact on the node being used for monitoring?

Accepted Answer

NPM process is configured to stop if it utilizes more than 5% of the host CPU resources. This is to ensure that you can keep using the nodes for their usual workloads without impacting performance.

Question 43

Does NPM edit firewall rules for monitoring?

Accepted Answer

NPM only creates a local Windows Firewall rule on the nodes on which the EnableRules.ps1 PowerShell script is run to allow the agents to create TCP connections with each other on the specified port. The solution does not modify any network firewall or Network Security Group (NSG) rules.

Question 44

How can I check the health of the nodes being used for monitoring?

Accepted Answer

You can view the health status of the nodes being used for monitoring from the following view: Network Performance Monitor -> Configuration -> Nodes. If a node is unhealthy, you can view the error details and take the suggested action.

Question 45

Can NPM report latency numbers in microseconds?

Accepted Answer

NPM rounds the latency numbers in the UI and in milliseconds. The same data is stored at a higher granularity (sometimes up to four decimal places).

Question 46

Does NPM support multi-homed nodes?

Accepted Answer

No. Each NPM node requires a dedicated log analytics workspace.

Question 47

What additional requirements does the NPM have for Linux?

Accepted Answer

The OMS agent for Linux also requires the use of GLIBC 2.14 or above.

Share via

Network Performance Monitor solution FAQ

Set up and configure agents