SCOM Event Remediation and Enrichment

Article
05/05/2010

Caution

Test the script(s), processes and/or data file(s) thoroughly in a test environment, and customize them to meet the requirements of your organization before attempting to use it in a production capacity. (See the legal notice here)

Note: The workflow sample mentioned in this article can be downloaded from the Opalis project on CodePlex: https://opalis.codeplex.com

Overview

The SCOM Event Remediation and Enrichment workflow is a very simple sample that shows how one could orchestrate a repair process for an event detected in Microsoft System Center Operations Manager. It’s very similar to the “SCOM Event Remediation” sample workflow except that in addition to making an attempt to remediate an incident (in this case “low disk space”) the alert is annotated with supplemental information. Data is provided both pre and post remediation to provide line-of-sight into the repair process. The use-case the workflow is built around is very simple:

Watch for an alert in Operations Manager that indicates a low disk space condition on a storage device.
If the service does start, we want to verify that in fact it is up and running.
Verify the disk space is actually low. If so then update the alert with the current storage level.
Compress TMP files and remove DMP files in an attempt to recover disk space.
After trying to recover some space, update the alert once again with the new storage levels.

The sample highlights a few key features associated with Orchestration of such a process:

The workflow is a classic example of a “Run Book Automation” in that it takes operations procedures that would normally drive the behaviors of human beings and replaces this work with automation and integration.
Showing how a remediation process can interact with Operations Manager to provide line-of-sight remediation. This means that it updates Operations Manager so that people looking at the Operations Manager console will be able to recognize that Opalis has initiated a remediation process and allow that process to complete before taking additional action.
Verification of the alert is a key first step in the remediation process since it guarantees that the remediation is acting on a valid condition before it initiates.

Workflow Walk-Through

This workflow itself is very simple and with a moderate amount of tweaking should be able to work in most environments. Some key things to note in the workflow itself:

Monitor SCOM for an alert that indicates a low disk condition on a storage device. The alert filter would no doubt require tuning to meet the specific needs of a given SCOM implementation.
Verify the disk space is actually low. Verification is a key part of a remediation process since if validates that the steps that follow are in fact acting on a condition that needs repair.
Perform the remediation actions (compress TMP files, delete DMP files, etc).
Check the storage levels again. This workflow could be made to self-resolve if the condition has been repaired, however as it has been orchestrated the alert is only updated with information post-remediation. It is not unusual for a process to require human interaction for it to be considered resolved.

Share this post :

SCOM Event Remediation and Enrichment

Overview

Workflow Walk-Through

Additional resources