Understanding Alert Correlation
Topic Last Modified: 2012-11-26
The Correlation Engine is at the core of the Microsoft Exchange Server 2010 Monitoring Management Pack. The Correlation Engine was developed to significantly reduce the number of alerts that are raised by the Management Pack.
In the Exchange 2007 Management Pack, alerts were always raised when the state of a monitor changed from green to red. This type of alerting is turned off in the Exchange Server 2010 Management Pack. Instead, the Correlation Engine handles alerting. It processes the data from the Management Pack monitors and then determines whether to raise an alert. The Correlation Engine helps the administrator who is monitoring the Exchange environment to focus only on alerts that may require an action.
Architecture
The Correlation Engine is a stand-alone Windows service that uses the Operations Manager SDK interface to first retrieve the health model (or instance space) and then process state change events. By maintaining the health model in memory, and processing state change events, the Correlation Engine is able to determine when to raise an alert based on the state of the system.
This diagram shows that several monitors change state in response to a problem, and the corresponding state change events are forwarded by the agent to the Root Management Server (RMS). Once received by the RMS, these events are processed by the Correlation Engine, which may raise an alert via the RMS Software Development Kit (SDK) interface. This alert then becomes visible on the Operations Manager Console.
Alert Classification
Exchange Server 2010 Monitoring Management Pack alerts are classified into one of three categories. Use the following guidelines to understand these alert classifications.
Key Health Indicator (KHI) KHIs are issues that affect the health of the service. Most alerts fall into this category (for example, "A mailbox database is dismounted.")
Non-Service Impacting (NSI) NSI monitors detect problems that may affect some users, but not every user of the system. A good example of an NSI situation is two users with the same proxy address – mail to this address will be returned as non-deliverable, but the overall transport system is not otherwise impaired.
Forensic Forensic monitors are used to record information that may be relevant while troubleshooting an issue, but isn’t necessarily indicative of an eminent or existing system failure. "CPU activity >90% for 5 minutes" is an example of a forensic issue – there may be a process inappropriately consuming CPU cycles, or the server may have been rebooted and is catching up on normal system activity. These monitors are visible in the Alert Context field of the alert properties and in Health Explorer. Alerts are not raised for Forensic monitors.
Note
State is not updated when a single forensic monitor alert is raised. However, state may be updated based on the aggregation of current forensic monitor alerts for each component.
Alert Severity
Exchange Server 2010 Monitoring Management Pack alerts are also classified by the severity of the alerts, as follows:
Error alerts Error alerts indicate a serious problem that requires immediate attention.
Warning alerts Warning alerts indicate a condition that might cause future problems.
Informational alerts Informational alerts are not raised by the Exchange 2010 Management Pack.
Correlation Factors
The actions taken by the Correlation Engine are based on the several factors, including the following:
Monitor state change events Monitors collect diagnostic information from the Exchange environment from sources such as event log messages, performance counter thresholds, and PowerShell task output events. Monitors register state change events when they detect that a problem has occurred or cleared (that is, changed from red to green or green to red). Monitors also register state changes when an Exchange server can’t be contacted or when an Exchange server becomes available. Finally, monitors register state changes when an Exchange server is placed in maintenance mode or removed from maintenance mode. In the Exchange 2007 Management Pack, alerts were raised when the state of a monitor changed from green to red. In the Exchange 2010 Management Pack, alerts aren’t automatically raised by monitor state changes. The Correlation Engine determines whether to raise an alert. The Exchange 2010 Management Pack includes an alert rule for every monitor. This allows monitoring personnel to use the Operations console to access the properties of every monitor in the Management Pack. They can enter company-specific notes for a given monitor in the Company Knowledge field even when the monitor doesn’t generate alerts on its own.
Health Model The class hierarchy imported into Operations Manager by the Exchange 2010 Management Pack includes class relationships that define component dependencies throughout the system. Defining these dependencies helps the Exchange 2010 Management Pack to understand the health of the Exchange organization. For example, if the Exchange 2010 Management Pack identifies Active Directory as offline, it will also report that Exchange messaging isn’t fully functional.
Timing The Correlation Engine works in 90-second intervals. When state change events for multiple monitors occur at the same time, the Correlation Engine waits to determine whether anything else potentially related to the failure is detected, which lets the Correlation Engine make the most effective determination of the root cause.
Correlation Algorithm
Overview of the Correlation Engine process
The Correlation Engine connects to the Operations Manager SDK service to download the health model hierarchy and instance state. This occurs only on service startup, or as needed if errors require it.
The Correlation Engine queries Operations Manager for the latest state change events that are related to entities in the Exchange Management Pack.
If new non-service impacting state changes are detected, the Correlation Engine raises alerts for them.
The Correlation Engine isolates the data for all key health indicator monitors that are in the red state. The Correlation Engine arranges that data into logical groupings that show each process in relation to the ones it depends on and the ones that depend on it. These groupings are commonly referred to as a “key health indicator chains”. Each chain indicates where a dependency has failed and is affecting one or more dependent processes.
The Correlation Engine raises an alert for each key health indicator chain. Each alert that the Correlation Engine raises identifies the root cause of each problem.
The Correlation Engine waits 90 seconds, and then starts over at Step 2.
Additional information about the Correlation Engine process
If the chain of key health indicators includes both error and warning monitors, the alert is raised as an error, regardless of the class of the root cause monitor. For example, if a top-level process defines an error monitor to catch failure cases, and if it is correlated to a warning monitor in a dependency, the alert will be raised against the dependency. But it will be marked as an error instead of a warning.
Not every class relationship is used for alert correlation. See the Appendix: Class Hierarchy later in this guide for the specific relationships used by the Correlation Engine.
The key health indicator chain, including any forensic monitors, is included in the Alert Context field that appears in the properties of the final alert. This allows the administrator to review the monitors that correlate with the given alert. Alerts that are raised from dependency monitors must be reviewed to determine the specific failure referenced by the alert.
What is and isn’t affected by alert correlation
It’s important to understand what the Correlation Engine affects and what it doesn’t affect.
The following functionality is different in the Exchange 2010 Management Pack due to the addition of the Correlation Engine:
Monitors don’t alert automatically when state change events occur. This lets the Correlation Engine determine the best alert to raise.
The Exchange 2010 Management Pack doesn't raise alerts that correspond to the health of your Exchange environment when the Correlation Engine is stopped. If the Correlation Engine is stopped, a general alert is raised to notify you that the Correlation Engine isn’t running.
The following functionality isn’t changed by the addition of the Correlation Engine:
Overrides still work as expected. You can change certain values or disable monitors just as you do today.
Monitors and objects in maintenance mode are skipped by the Correlation Engine. No special consideration is required because the monitors don’t raise state change events.
Other management packs aren’t affected by the presence of the Correlation Engine.
Operational notes
The Correlation Engine must maintain the instance space of the management group in memory to determine related monitors and alerts. Thus, the more Exchange servers and databases you have, the more memory the Correlation Engine will require.
The Correlation Engine requires at approximately 5 megabytes of memory per monitored Exchange server. There are factors that can cause this number to go up or down, but this is a good baseline for understanding the resource impact on the server that’s hosting the service.
Automatic Reset of Event Monitors in the Exchange 2010 Management Pack
In the Exchange 2010 Management Pack, most event monitors are automatically reset by the Correlation Engine. Automatic reset was added to those monitors so that issues aren't missed the next time they occur. The following table lists the event monitors that aren’t reset automatically.
Monitor Name |
---|
An error occurred while the journaling agent was loading configuration information. |
A failure is causing a message to remain in a delivery queue. |
Your Autodiscover service configuration isn't secure. To fix this problem, disable anonymous access on the Autodiscover virtual directory. |
Exchange couldn’t create the log file directory. Log files won't be generated until the reason for the failure is corrected. The source component and cause of the error are specified in the event description. |
Exchange couldn’t create a new log file. Log files won't be generated until the reason for the failure is corrected. The source component and cause of the error are specified in the event description. |
Read-only files have been found in the Pickup directory. |
The Microsoft Exchange Transport service has detected a critical storage error and has taken an automated recovery action by moving the database. |
File Distribution Service: Failed to read the security descriptor from Active Directory for the offline address book. |
ExBPA warning. |
ExBPA error. |
Unable to move mailbox. |
DsProxy DLL is required but can’t be loaded. |
Performance counters for NSPI Proxy couldn’t be initialized. |
The index is corrupted on the local database copy. Please reseed the catalog by using the Update-MailboxDatabaseCopy cmdlet with the -CatalogOnly parameter. |
Unable to load the performance counters for the Microsoft Exchange Mail Submission service. The related performance object is named MSExchangeMail Submission. |
The local topology server doesn’t belong to any Active Directory site. |
The Microsoft Mail Submission Service encountered an exception when trying to load network topology information. |
Exchange Topology discovery couldn't find the local Exchange server in Active Directory. |
A failure is causing a message to remain in the Submission queue. |
A database copy encountered a serious lost flush error that may have affected all copies of the database. |
An active database copy encountered a serious lost flush error that may have affected all copies of the database. |
A local database copy encountered a serious lost flush error that may have affected all copies of the database. |
The database engine has consumed 99% of the "b-trees" resource (87048 used out of a maximum of 87696) for the database. |
A database copy's incremental reseed files failed to be removed. |
Failed to remove continuous replication files for a database copy. |
The single page restore process started correcting an error in a database copy. |
The single page restore process successfully corrected an error in a database copy. |
Failed to remove a log file for database. Either the file is in use or the service has insufficient permissions. |
The correlation interval value specified is less than the minimum allowed value. |
The specified correlation time window value is less than the minimum allowed value. |