Custom APM Rules for Granular Alerting
Michael wrote a post in August, about working with Alerts. One of the key takeaways from that post, is that – for each application and for each application component – we have FOUR Alerting rules, that can be turned on or off from the APM Template. Quoting that post:
Alerting Rules
There is a rule for each type of event we alert on: Performance, Connectivity, Security and Application Failure. We raise an individual alert when those types of events are detected in the monitored application. These alerts do not affect the health state of the monitored application since a single performance or exception event doesn’t mean your application is unhealthy.
The post above was written during Beta, and the UI has improved since then, but the rules are still there and work in the same way that was described.
In the RC build the options in the Template UI look like this:
While the Authoring Guide describes what the options are used for, here I want to show you a bit more of what settings they drive “under the hood”.
The two checkboxes on the top turn ON and OFF:
- Alerting for “Performance” events (“Turn on performance event alerts”)
- Alerting for “Exception” events (“Turn on exception event alerts”) – when this checkbox is enabled, it allows you to configure three more options (a breakdown of the type of exceptions):
- Security alerts
- Connectivity alerts
- Application failure alerts
Together, these checkboxes essentially enable and disable the previously-mentioned Alerting rules. You can find them in the Authoring pane of the Operations Console, under “Rules”:
The names should be all pretty self-explanatory to understand which one maps to which option.
While this mechanism is flexible enough for the most common usage, I want to show you how the whole thing works end to end, and show you how the solution is powerful and flexible, and how you can do even more with APM and configure even more granular alerts than the UI allows you to – with little XML editing.
If you look at those rules (in the Operations Console, or by un-sealing the MP and watching its XML), they are all very similar: they have a Data Source looking for the incoming APM events, and a Write Action that turns them into Alerts.
The Data Source has a configuration as follows:
As you can see in the screenshot above, the same data source is used for all four rules, and the “AspectType” is used to tell apart Performance, Connectivity, Security and Application failure events.
This is great for most situations, and our default settings have been chosen with the assumption that Operations folks would be more interested in Performance, Connectivity and Security events – those where they might be able to operate – but not necessarily about “Application failure” events, since those are (often) a bug in the code, and (typically) only a developer can fix those exceptions.
Even if this model is great, I found that, in some situations, people might want to have even more fine-grained alerting rules defined. In particular, I think the connectivity and security aspects are quite well-defined in our APM default configuration and they are typically not noisy unless something is really wrong. The same is not necessarily always true for Performance events and Application failures. For example you might want to get:
- Performance event alerts only for a specific web page or method (this can also be achieved by defining a transaction, but depending on the situation one approach or the other might be preferred – I’ll explain transactions in a future post)
- Performance event alerts for all cases but excluding a particular page/method which is “well known” to be slow but can’t be fixed/optimized (this is something that cannot be achieved even with a transaction)
- Application failure event alerts only for certain type of exceptions and not for other ones
- Application failure events alerts for all exception but excluding a particular page which is known to throw an un-handled exception but doesn’t cause bad user experience or can’t be fixed
- Application failure events alerts for all exception but excluding a particular exception type which can’t be fixed by the developer
- one specific situation where #5 is desirable is when someone, calling a page which is not present on an ASP.NET application, will result in throwing a “System.Web.HttpException” with an HTTP 404 Error (not found) code – this is by design in ASP.NET: if I call an .aspx page, the ASP.NET engine will try to retrieve it and will be throwing an HTTP error; this could cause a lot of noise in case a crawler or vulnerability assessment tool hits the site searching for “well-known” but not-present pages (this is actually something that we observed on the production deployment monitoring parts of the microsoft.com website)
For all these situations (and more) there is a a fairly simple solution: writing new APM Alerting rules with an added Expression Filter. Basically we’ll have a workflow which looks like the following:
One such a sample rule is pasted below. It looks very similar to (and in fact, it is derived from) the “default” APM alerting rules described earlier – only the Condition Detection highlighted has been added. This one rule represents example #5 from the list above – essentially, it should filter out those “page does not exist” 404 errors, but still alert on every other exception.
|
This rule still produces alerts for other exceptions that “look and feel” pretty much like the built-in rules, but will not raise an alert for those HTTP 404’s “file does not exist” errors. Although be aware that the example above will not work on localized .NET Framework/Windows version, because I am searching for a English string (“does not exist”) in the error message. This is not really meant as a production-quality MP considering all cases, just as a quick example of how you can build your own workflows by adding filtering criteria, and my goal it mostly to help you understand how the APM pieces fit together in Operations Manager 2012 so that, with that knowledge, you can get creative and adapt it to your needs.
Anyway, an alternative would be digging out the actual HTTP Error code, which is buried down in the DataItem as well. To do so, we can rewrite our Expression Filter as follows:
|
Since the code above is is not easily readable due to the blog layout (but should be possible to copy/paste it just fine), a Management Pack with both variations of this rule is attached at the end of this post. It also contains two more (fairly similar) examples, for a total of three rules based on “Application Failure” events and one on “Performance” events. All the rules are disabled by default to prevent duplicate alerts starting to appear in your environment as soon as you import the MP – if you use these rules, you might want to disable the checkboxes in the template for the “built-in” rules, first. You can then decide to turn these new rules on by default, or selectively thru overrides, as you would do with any other rule.
Please also note that some of the criteria contained in the condition detection filters cannot be edited from the Operations Console, and alert messages token replacement will also most likely break if edited thru the GUI. These things are best edited in XML.
In addition, these criteria might need to be revisited fairly often as part of your tuning – known problems could became worth considering again, and new issue might appear that you want to start filtering out, and so on.
These rules with their filters will only affect Alerting; all APM events will still be collected in the database and be visible thru the AppDiagnostics console – as the actual event insertion is driven by a different rule (using the same data source module, but a different write action).
Once events are stored and visible in AppDiagnostics, we also provide ways to automatically delete them from the database, or mark as “by design” those events that aren’t considered useful or interesting, or appear to add noise. This is the Problem Management feature in AppDiagnostics (Rules Management Wizard), which – while it doesn’t prevent events from being stored or alerts from being raised in the first place - it helps keeping your database “clean” and I like to consider it a sort of “intelligent” grooming. There would be a lot to be said about the Problem Management feature - I’ll try to come back to this feature and its rules in a future post.
Happy .Net Monitoring!
Disclaimer
This posting is provided "AS IS" with no warranties, and confers no rights. Use of included utilities are subject to the terms specified at https://www.microsoft.com/info/copyright.htm.