SCOM Ping Monitor

Question

SCOM Ping Monitor

Saiyad Rahim 411

Hi All,

Someone in our environment rebooted a Critical Production Server and no alert was received from SCOM.

Management is out for blood as to why and to have it fixed.
Issue i find is that the Server is a VM and it went down and up in 6 seconds.

SCOM is too slow to detect this failure as it only polls every 60 secs and will alert of a failure on the 4th minute.

Has anyone been able to find a suitable fix to this apart from setting up Event ID monitors for Shutdown, Startup etc.

I have also tried OpsLogix Ping Monitor but find that while it is effective in alerting, I can not customise alert console descriptions which is a significant draw back for alerts going to my Level 1 support.

I am thinking of a Powershell monitor that should be run "independent" of the SCOM Agent as if the SCOM Agent Service stops for any reason, for example a server is being shut down, it will kill the Agent service and I might not receive any alert from the script.

Does anyone out here have any good ideas or such a script that can help save my bacon.

ThoRumAT 1 Reputation point

2021-03-12T12:00:48.4+00:00

I'd rather use a Startup/Shutdown PowerShell Script. You can do this via AD or local Group Policy setting.
Within the script you could send an email to the 1st level support, noting that the system is going to shutdown and has been started.

The system waits until the script has finished and won't shutdown.

Additionally you can create an override within the Agent heartbeat settings for the critical production server, and lower the heartbeat interval as low as 5 seconds.

4 answers

Your answer

ThoRumAT 1 Reputation point

2021-03-12T12:00:48.4+00:00

I'd rather use a Startup/Shutdown PowerShell Script. You can do this via AD or local Group Policy setting.
Within the script you could send an email to the 1st level support, noting that the system is going to shutdown and has been started.

The system waits until the script has finished and won't shutdown.

Additionally you can create an override within the Agent heartbeat settings for the critical production server, and lower the heartbeat interval as low as 5 seconds.

Answer 1

SCOM uses heartbeat to determine whether agent is up or not. If this setting is too short, it has high possibility that it will generate a false alert owning to netwrok or communicate issue. If you want to monitor server up and down for a short period of time such as 6 seeconds, you should create following event alert for server reboot.

You may create an event alert rule for monitoring follow event
Event ID Description
41 The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.
1074 Logged when an app (ex: Windows Update) causes the system to restart, or when a user initiates a restart or shutdown.
6006 Logged as a clean shutdown. It gives the message "The Event log service was stopped".
6008 Logged as a dirty shutdown. It gives the message "The previous system shutdown at time on date was unexpected".

Roger

Saiyad Rahim 411 Reputation points

2021-03-10T05:14:45.887+00:00

Hi Roger,

Is there any way to get alerted from SCOM when the server actually goes "down" (like with in 3 -4 seconds) and not when the server is "Up" and running?

Answer 2

System Center guy 691

You may consider using free OpsLogix Ping Management Pack to ping target host to check whether it is up/down. You can configure how many seconds are waited before performing the next ICMP ping and how many "ping replies" can be missed before raising an alert.

https://www.opslogix.com/ping-management-pack

Roger

CyrAz 5,181 Reputation points

2021-03-10T08:37:30.38+00:00

Agreed, especially because it uses a managed module so the workflow will be noticeably lighter than using a scripted monitor for example.
Saiyad Rahim 411 Reputation points

2021-03-10T19:01:05.537+00:00

Thanks Guys.

I do have that installed and i agree that is the closest i got to what I need.

However, in OpsLogix, I can't find a way to be able to Target specific Groups instead of Adding Hosts Manually and is there a way to Override the Alert Description on the scom console with custom message/text?

Customising the Alert Description on the console is important for Support Teams to read and follow instructions for each type of Server when it goes down and escalate to correct team accordingly.
CyrAz 5,181 Reputation points

2021-03-10T19:23:03.847+00:00

No you can't target groups with this MP (and in a more general way, you can't target groups when you create monitors or rules in SCOM).
You can however override the alerts names and descriptions : https://kevinholman.com/2020/08/02/how-to-override-the-alert-name-and-alert-description-of-a-sealed-monitor/
Saiyad Rahim 411 Reputation points

2021-03-10T20:03:37.957+00:00

Hi Cyril,

I have seen this article but I am yet to give it a try.
What I am not sure of here is how do I set different descriptions for example for a File Server alert and different Description for SQL Server Alerts.

From reading this article, it seems like it will be the same "Text Description" for any Server type that generates a Ping Lost alert....close but no cigar.
Hence, if I could Target this Monitor to Individual Groups, I could Override the Text and Target it to appropriate Groups.

Just seems like OPsLogix need to take that extra few steps to complete this Monitor....or have a Pro Version with all the features that a MP requires.
CyrAz 5,181 Reputation points

2021-03-10T21:02:02.44+00:00

You simply could create a different override for the File Server and for the SQL Server...
Now if you're really interested in a "custom" paid for version, you can try sending an email to Opslogix, they may be interested in authoring it for you...

Answer 3

Saiyad Rahim 411

How though?

There is no override option in GUI so it will need to be in the XML right?

But how do I craft the XML to identify to use "Text Description A" where Server Name %filesvr% and use "Text Description B" where Server NAME %SQL%....is this the logic to use or something else?

Answer 4

@Saiyad Rahim well I'm all against running rules/monitors every few seconds, and as an exception I might agree to 1 min. Probably less of an issue in smaller environments than large ones

so you want pings against server with meaningful messages for each, but don't want to do a lot of work...

that's either a lot of rules, effort to setup a template (or install someone else) and deploy for each IP, or cheating...

what do I mean by cheating. Well you can dynamically name a rules alert message, many don't know this. so you could do it all in two rules, and no xml editing or extra tools/management packs

setup a rule to run a PowerShell script (easiest if community PowerShell management pack is installed) . that script will for instance pull all instance of a group (scom calls, csv,etc), or use parameters to specify whats being pinged, ping it and drop the results, to an event log (easiest way) and make sure key elements are parametrized for extraction or the line is exactly what you want as an alert message
setup an event log monitor to monitor the event log and use dynamic naming based on the parameters.

ie "Cannot Ping Server %1 from %2"

how does it work. well if you drop the parameter data from the event into the alert, you can reference that in the alert name, as a positional parameter.

ie for our above alert if they were Param [1] and param [2] and alert message:

Test Pings failing
Target Device : $Data/Params/Param[1]$
Source Device : $Data/Params/Param[2]$

or the whole event item... and alert name is %1

$Data[Default='']/EventDescription$

another cheat to make this a HA rule (ie not reliant on management server 'A') is create a management pack for this to be in, add a resource pool to it (will need to edit xml as otherwise you cant target a pool) set the pool to manual, add just the scom servers you want in there, then add the rules against the pool

if this seems all a bit dodgy, in a way it is.

Share via

SCOM Ping Monitor

4 answers

Your answer