scom not reporting high cpu

John Curtiss 66 Reputation points
2021-04-12T19:00:03.66+00:00

happens all the time, has happened for several versions of scom and several versions of windows. an agent server goes from 15% cpu to 100% cpu in about 3 seconds, which is super unhealthy. but scom never notices because the agent server is too busy to let the agent tell scom about it, i guess? so the monitor never changes state, so no alerts are generated, no recoveries are started. has anybody else seen this, and how do you handle it?

here's a box that was at 100% cpu for over two days straight. no alert. not until somebody manually went in and restarted a runaway service this morning did scom start seeing CPU readings again.

86979-servercpu.jpg

System Center Operations Manager
System Center Operations Manager
A family of System Center products that provide infrastructure monitoring, help ensure the predictable performance and availability of vital applications, and offer comprehensive monitoring for datacenters and cloud, both private and public.
1,610 questions
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. Crystal-MSFT 53,991 Reputation points Microsoft External Staff
    2021-04-13T02:11:40.15+00:00

    @JohnpCurtiss, Research and find a blog from Kevin describe our situation, this seems to be that the monitor runs every 15 minutes, and evaluates after 3 samples. The samples are not consecutive samples. they are AVERAGE samples.

    Before a monitor state change, all the thresholds must be met This means that even if our server is stuck at 100% CPU utilization, it will not genet an alert most of the time. We can see more details in the following link:
    https://kevinholman.com/2017/05/13/how-does-cpu-monitoring-work-in-the-windows-server-2016-management-pack/
    Note: Non-Microsoft link, just for the reference.

    Hope it can help.


    If the response is helpful, please click "Accept Answer" and upvote it.
    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  2. System Center guy 691 Reputation points
    2021-04-14T03:15:04.317+00:00

    Issue

    1. an agent server goes from 15% cpu to 100% cpu in about 3 seconds and no SCOM alert

    > By default, SCOM uses monitor "Total CPU Utilization Percentage" to monitor high CPU utilization but this monitor only generate alert when CPU Queue Length and utilization high than threshold. So, merely high CPU utilization does not trigger the alert.

    Roger

    0 comments No comments

  3. John Curtiss 66 Reputation points
    2021-04-14T03:26:08.707+00:00

    That's not it. My queue length has been set to zero via override for a very long time. My interval is also ten minutes, and samples is set to two. I get cpu alerts all the time when a server sits at 96% for ten minutes. This is a separate problem.


  4. CyrAz 5,181 Reputation points
    2021-04-14T08:18:51.31+00:00

    Don't take my word for it, but if the CPU is so high that SCOM agent can't even collect perf metrics, it would make sense it can't either send an alert about these metrics.
    However there may have been alerts about WMI query failed or failed scripts or failed perf counter collection etc; whether in SCOM itself or in Operations Manager event log.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.