SCOM Ping Monitor

Saiyad Rahim 351 Reputation points
2021-03-04T21:00:40.117+00:00

Hi All,

Someone in our environment rebooted a Critical Production Server and no alert was received from SCOM.

Management is out for blood as to why and to have it fixed.
Issue i find is that the Server is a VM and it went down and up in 6 seconds.

SCOM is too slow to detect this failure as it only polls every 60 secs and will alert of a failure on the 4th minute.

Has anyone been able to find a suitable fix to this apart from setting up Event ID monitors for Shutdown, Startup etc.

I have also tried OpsLogix Ping Monitor but find that while it is effective in alerting, I can not customise alert console descriptions which is a significant draw back for alerts going to my Level 1 support.

I am thinking of a Powershell monitor that should be run "independent" of the SCOM Agent as if the SCOM Agent Service stops for any reason, for example a server is being shut down, it will kill the Agent service and I might not receive any alert from the script.

Does anyone out here have any good ideas or such a script that can help save my bacon.

Operations Manager
Operations Manager
A family of System Center products that provide infrastructure monitoring, help ensure the predictable performance and availability of vital applications, and offer comprehensive monitoring for datacenters and cloud, both private and public.
1,409 questions
{count} votes

4 answers

Sort by: Most helpful
  1. System Center guy 686 Reputation points
    2021-03-10T03:42:58.667+00:00

    SCOM uses heartbeat to determine whether agent is up or not. If this setting is too short, it has high possibility that it will generate a false alert owning to netwrok or communicate issue. If you want to monitor server up and down for a short period of time such as 6 seeconds, you should create following event alert for server reboot.

    You may create an event alert rule for monitoring follow event
    Event ID Description
    41 The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
    1074 Logged when an app (ex: Windows Update) causes the system to restart, or when a user initiates a restart or shutdown.
    6006 Logged as a clean shutdown. It gives the message "The Event log service was stopped".
    6008 Logged as a dirty shutdown. It gives the message "The previous system shutdown at time on date was unexpected".

    Roger


  2. System Center guy 686 Reputation points
    2021-03-10T08:26:00.66+00:00

    You may consider using free OpsLogix Ping Management Pack to ping target host to check whether it is up/down. You can configure how many seconds are waited before performing the next ICMP ping and how many "ping replies" can be missed before raising an alert.

    https://www.opslogix.com/ping-management-pack

    Roger


  3. Saiyad Rahim 351 Reputation points
    2021-03-10T21:09:23.257+00:00

    How though?

    There is no override option in GUI so it will need to be in the XML right?

    But how do I craft the XML to identify to use "Text Description A" where Server Name %filesvr% and use "Text Description B" where Server NAME %SQL%....is this the logic to use or something else?

    0 comments No comments

  4. Dwayne 1 Reputation point
    2021-11-22T03:27:15.8+00:00

    @Saiyad Rahim well I'm all against running rules/monitors every few seconds, and as an exception I might agree to 1 min. Probably less of an issue in smaller environments than large ones

    so you want pings against server with meaningful messages for each, but don't want to do a lot of work...

    that's either a lot of rules, effort to setup a template (or install someone else) and deploy for each IP, or cheating...

    what do I mean by cheating. Well you can dynamically name a rules alert message, many don't know this. so you could do it all in two rules, and no xml editing or extra tools/management packs

    1. setup a rule to run a PowerShell script (easiest if community PowerShell management pack is installed) . that script will for instance pull all instance of a group (scom calls, csv,etc), or use parameters to specify whats being pinged, ping it and drop the results, to an event log (easiest way) and make sure key elements are parametrized for extraction or the line is exactly what you want as an alert message
    2. setup an event log monitor to monitor the event log and use dynamic naming based on the parameters.

    ie "Cannot Ping Server %1 from %2"

    how does it work. well if you drop the parameter data from the event into the alert, you can reference that in the alert name, as a positional parameter.

    ie for our above alert if they were Param [1] and param [2] and alert message:

    Test Pings failing
    Target Device : $Data/Params/Param[1]$
    Source Device : $Data/Params/Param[2]$

    or the whole event item... and alert name is %1

    $Data[Default='']/EventDescription$

    another cheat to make this a HA rule (ie not reliant on management server 'A') is create a management pack for this to be in, add a resource pool to it (will need to edit xml as otherwise you cant target a pool) set the pool to manual, add just the scom servers you want in there, then add the rules against the pool

    if this seems all a bit dodgy, in a way it is.

    0 comments No comments