The Anatomy of a Good SCOM Alert Management Process – Part 1: Why is alert management necessary?
Disclaimer: Due to changes in the MSFT corporate blogging policy, I’m moving all of my content to the following location. Please reference all future content from that location. Thanks.
I’ve had the luxury of doing SCOM work for several years now, across many different client types and infrastructure, and one of the constants that I see in most every environment that I’ve worked in is a lack of planning surrounding how SCOM will be used in their environment. There are a number of reasons for why this is, ranging from a lack of understanding about how the tool works, to internal political issues, administrative problems such as lazy or secretive administrators, to many other things as well. The solutions to these problems are not always easy and require a bit more than just tossing in a piece of technology so that we can check a box and say “we have monitoring.’' The reality though, is that this is precisely how many SCOM environments are designed.
We at Microsoft have often been very good at solving technical problems, and the SCOM community as a whole has a number of fantastic blogs ranging from frequent bloggers such as Kevin Holman, Stefan Stranger, and Marnix Wolf to folks such as myself who are far less intelligent and do far less blogging. In all, if you need to solve a technical problem in SCOM, it’s not hard to do it as someone has already done it. Unfortunately, I think this is where we can often fall short, as we leave it up to our customers to use our products, and they can have some very interested ways of using them. SCOM is no different. The biggest problem with SCOM that I see is that organizations never address the people or processes surrounding the technology that they purchased.
First, let’s start with the obvious. SCOM is very good at telling users that something is wrong. It’s not hard to spin up, and after tossing in a few management packs, you will quickly start seeing alerts ranging from simple noise to real problems. SCOM engineers quickly realize that there are lots of really cool Microsoft and Non-Microsoft MPs out there along with some really good ideas (and bad ones). It does not take long for a customer to deploy a bunch of agents, import some management packs, and next thing they know, their alerts screen is full of red and yellow alerts indicating that something may be wrong.
What I find from here though, is that the alert management process pretty much stops at this point. Yes, there are organizations that truly do try to have an end to end alert lifecycle, but in just about every organization that I’ve visited, they are stuck at this point even while thinking they aren’t. These orgs have a SCOM administrator, who often wears multiple hats, and maybe perhaps a tier 1 staff watching the active alerts to some extent. Usually, tier 2 and 3 are completely disengaged, never touching the SCOM console or perhaps going so far as actively resisting monitoring attempts. In an attempt to bring monitoring issues to light, orgs decide to send emails or generate tickets on alert generation. Generating tickets usually frustrates the help desk, as SCOM can quite literally generate thousands of alerts a day, which essentially also makes SCOM our own private spam server. Administrators create rules and put SCOM alerts in folders, and in the end, nothing gets changed while the same alert generates tens, hundreds, or even thousands of emails that go unanswered, and the problem is never actually solved.
The problem is never solved because of a fundamental lack of understanding of what needs to be done and why. There are a few reasons why alert management is necessary:
There are technology issues with the product which require it. State changes are not groomed from the database while the state of the object associated is unhealthy. A failure to manage alerts and fix issues associated with them leaves data in the database beyond its grooming requirement. Likewise, state and performance data can be stored in the DataWarehouse for a long period of time. Failure to manage alerts can lead to a very large DW, often times containing lots of data that the customer could care less about and eventually leading to performance issues if it is not being managed.
All environments are different. This should go without saying, but it means that it is impossible for SCOM to meet the exact needs of your organization OUT OF THE BOX. Thresholds for alerts in one organization may be too high, while in others too low. In some orgs, the monitor or rule is not applicable, and in some cases, what they really want to monitor is turned off by default. As such, the SCOM administrators primary job is to tune alerts.
Tuning, while seeming like a simple job, requires teamwork. While it would be nice if your SCOM administrator is a technology guru, the reality is that this engineer likely knows bits and pieces about AD, Platforms, Clustering, Skype, DNS, IIS, SharePoint, Azure, Exchange, PKI, Cisco, SAN, and whatever else you happen to have in your environment. He or she will likely not know these products in detail and as such relies on tier 1 and 2 to investigate issues as well as tier 3’s input as problems are uncovered. That problem is further complicated by processes which need to change as actions by other IT administrators/engineers can lead to additional alerts for reasons as simple as not putting an object into maintenance mode before rebooting it.
As such, an alert management lifecycle is necessary to handle the end to end life of an alert, whether that be creation to resolution in the event of real problems or the tuning of alerts to reduce noise.