Alerts Management
Potential issue on Alerts management process
Many time Operators team manages problem regarding OpsMgr2007 Alerts views.
When an alert is raised, an operator acknowledges the alert, manages the issue, and closes the alert if not already done.
The following chapters explain why this can be an issue, and how to manage it.
Alert concept in OpsMgr2007
2 types of alert are managed in OpsMgr2007.
Ø Alert generated by Rule
Ø Alert generated by Monitor
Alerts generated by Rule
This type of alert is generated by an OpsMgr2007 rule which doesn’t affect the health of the target object (Alert Source).
In alert details windows, a link appears on the Alert Rule:
Alert generated by rule can be configured to consolidate if the problem is raised again. In this case the Repeat Count field is updated:
Alert context tab contains the last event which has been used to generate the alert:
Auto resolved alert process affects only alert generated by rule, if alert is still in state NEW:
Generally rules are used to collect information, as events or performance counters. This information is used for troubleshooting, analysis, capacity planning, reporting …
Rules are also sometime used for proactive monitoring, in this case the rule is configured to generate alert.
Converted management packs can also content rules which generate reactive alerts as MOM2005 didn’t have monitor concept.
This type of alert is auto resolved only by the OpsMgr2007 auto resolved process, if this one is still in resolution state “New”, or if the alert source is healthy.
Summary |
Ø Health of target object is not affected |
Ø Alert context contains the last event |
Ø Auto resolved if the resolution state still NEW, or if alert source is healthy |
Ø No auto resolved when issue is solved |
Alert generated by Monitor
This type of alert is generated by an OpsMgr2007 Monitor which affects the health of the target object (Alert Source).
In alert details windows, a link appears on the Alert Monitor:
Repeat count field is never updated.
Alert is used as a notification when the monitor updates the heath state of an object to Healthy to Critical (or Warning).
IMPORTANT |
If alert is closed manually, the Monitor heath state of related object is not updated to Healthy, and if the problem still occurs, the monitor will never generate a new alert. Therefore an alert generated by a monitor, rather than a rule, should not have its alert closed manually but the alert should be managed by the health of the target object. If the health returns to healthy then the alert will automatically close.
|
Alert generated by a monitor is also closed by the OpsMgr2007 auto resolved process if the resolution state is still New, however the health state of the alert source is not updated. |
The alert context tab contains the event which has been used to change the monitor health state and generate the alert:
Summary |
Ø Never close manually an alert generated by a Monitor |
Ø Manage the problem by using Health Explorer |
Ø Alert is automatically closed if problem is solved and if monitor has received the configured healthy event |
Ø Alert closed automatically by OpsMgr connector do not reset the monitor heath state. |
Ø Alert is also closed by OpsMgr2007 auto resolved process if the resolution state is still New, however the health state of the alert source is not updated. |
How to manage this behavior
As it is not possible to prevent an alert that has been generated by a monitor being closed by an operator and therefore not possible to ensure that the health state of the monitor has also been reset to healthy, I have developed two tools to manage this behavior:
Ø ResetMonitorFromAllClosedAlerts.exe
Ø ResetMonitorfromAlertId.exe
ResetMonitorFromAllClosedAlerts
This tool scans all closed monitor alerts and checks the state of the related monitor, and if the monitor is not healthy, the state is reset. At the next occurrence of the monitor after this has run, if the issue in question is still occurring then a new alert will be raised.
Also, to be sure that the scanned alert is the alert related to the last time the monitor state has changed, the tool will compare the time the alert was added and the last health state change value of the monitor. This needs to be less than 90 seconds, which is a reasonable indicator that this alert and health state change are related.
Alert.TimeAdded - (DateTime)monitoringObject.GetMonitoringStates(monitors)[0].LastTimeModified).TotalSeconds)) < 90 |
This tool can be launched from the command line on the RMS server.
Without option, the tool doesn’t reset any monitor, but shows all monitors that should be reset.
==== Reset Monitor!!! |
Monitoring ObjectPath: OM2007R2.dom02.com |
Alert Name: SPEC - Monitor Object from Syslog Event (critical/information) |
Alert ResolvedBy: DOM02\Administrator |
Alert TimeAdded: 13.07.2009 16:14:15 |
|
Monitor DisplayName: SPEC - Monitor Object from Syslog Event (critical/information) |
Monitor HealthState: Error |
Monitor LastTimeModified: 13.07.2009 16:14:15 |
With option –r, all detected monitor will be reset.
ResetMonitorFromAllClosedAlerts.exe –r
ResetMonitorfromAlertId
Using the same principle, this tool takes in argument of an alert ID, and if it is an alert raised by a monitor which has been subsequently closed, it checks the state of the related monitor, and if the monitor is not healthy, the state is reset.
This tool can be launched automatically by creating a notification channel.
The detail of this implementation is explained in chapters below.
How to implement “ResetMonitorfromAlertId” tool.
OpsMgr2007 SP1 - Create a notification to launch “ResetMonitorfromAlertId” tool when alert is closed.
Create a new Command notification Channel
Create a new subscriber
Create a new subscription
OpsMgr2007 R2 - Create a notification to launch “ResetMonitorfromAlertId” tool when alert is closed.
Create a new Command notification Channel
Create a new command notification channel |
Create a new subscriber
Create a new subscription
Monitoring
Events created
Log Name: Operations Manager Source: OpsMgr2007 ResetMonitorFromAlertId Date: 13.07.2009 16:12:46 Event ID: 1000 Task Category: None Level: Information Keywords: Classic User: N/A Computer: OM2007R2.dom02.com Description: Start ResetMonitorFromAlertId Event Xml: <Event xmlns="https://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="OpsMgr2007 ResetMonitorFromAlertId" /> <EventID Qualifiers="0">1000</EventID> <Level>4</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2009-07-13T14:12:46.000Z" /> <EventRecordID>444968</EventRecordID> <Channel>Operations Manager</Channel> <Computer>OM2007R2.dom02.com</Computer> <Security /> </System> <EventData> <Data>Start ResetMonitorFromAlertId</Data> </EventData> </Event> |
Log Name: Operations Manager Source: OpsMgr2007 ResetMonitorFromAlertId Date: 13.07.2009 16:12:49 Event ID: 1000 Task Category: None Level: Information Keywords: Classic User: N/A Computer: OM2007R2.dom02.com Description: Manage Alert with GUID: 5fd07143-a3ac-4eb2-8897-b73b6a80fa6e Event Xml: <Event xmlns="https://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="OpsMgr2007 ResetMonitorFromAlertId" /> <EventID Qualifiers="0">1000</EventID> <Level>4</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2009-07-13T14:12:49.000Z" /> <EventRecordID>444970</EventRecordID> <Channel>Operations Manager</Channel> <Computer>OM2007R2.dom02.com</Computer> <Security /> </System> <EventData> <Data>Manage Alert with GUID: 5fd07143-a3ac-4eb2-8897-b73b6a80fa6e</Data> </EventData> </Event> |
Log Name: Operations Manager Source: OpsMgr2007 ResetMonitorFromAlertId Date: 13.07.2009 16:12:55 Event ID: 1010 Task Category: None Level: Information Keywords: Classic User: N/A Computer: OM2007R2.dom02.com Description: Monitor resets by ResetMonitorfromAlertId MonitorDisplayName: SPEC - Monitor Object from Syslog Event (critical/information) AlertName: SPEC - Monitor Object from Syslog Event (critical/information) MonitoringObjectPath: OM2007R2.dom02.com Event Xml: <Event xmlns="https://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="OpsMgr2007 ResetMonitorFromAlertId" /> <EventID Qualifiers="0">1010</EventID> <Level>4</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2009-07-13T14:12:55.000Z" /> <EventRecordID>444971</EventRecordID> <Channel>Operations Manager</Channel> <Computer>OM2007R2.dom02.com</Computer> <Security /> </System> <EventData> <Data>Monitor resets by ResetMonitorfromAlertId MonitorDisplayName: SPEC - Monitor Object from Syslog Event (critical/information) AlertName: SPEC - Monitor Object from Syslog Event (critical/information) MonitoringObjectPath: OM2007R2.dom02.com</Data> </EventData> </Event> |
A rule can be created to collect the following event from RMS.
Target |
Root Management Server |
Event Log |
Operations Manager |
Source |
OpsMgr2007 ResetMonitorFromAlertId |
Event ID |
1010 |
Maximum number of asynchronous responses configuration on RMS Server
As it’s described in the following blog article the hardcoded limit is maximum 5 asynchronous responses in OpsMgr2007 SP1.
So if more than 5 alerts are closed in the same time, the following event should appear in Operation Manager Event log on the RMS server.
Alerts “Script or Executable was Dropped”
“The process could not be created because the maximum number of asynchronous responses (5) has been reached, and it will be dropped. Command executed: ………”
The following Event should be controlled: |
Event Log Windows (event collected by OpsMgr2007) : Event Log: Operations Manager EventI: 21410 Source: Health Service Modules |
This limit has been removed in OpsMgr2007 R2, but for performance reason this limit can be set also as follow. |
This limit can be modified by changing the following registry key:
“HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Modules\Global\Command Executer”
· Create Keys: Global\Command Executer
· Create a DWORD value called “AsyncProcessLimit" and set it between 1 and 100.
Outside of this key, it will default back to 5
This modification can affect the RMS performance, so it’s important to not increase too much this value, and to check the performance after modifying it. |
Value can be set to 20, and then EventId 21410 can be controlled to see if it’s enough, or if the value should be increased. |
Comments
- Anonymous
January 01, 2003
Hi Thierry, Much better now! ;-) Keep up the great work. Regards, Stefan Stranger