Top 10 Topics for MSCOM Ops…Monitoring and Reporting (oh yeah!)
This came in as the second most popular topic that you are interested in. Huh, imagine that. You actually have to report out to somebody on what the heck the environment that you manage is doing. So do we. Daily, weekly, monthly, quarterly, yearly. Oh yeah we feel the same pain that you do.
We took this topic and shot it to our Tools Team. Yep, we do have the luxury of having a team a crack developers most of whom used to actually be working, fully functional Systems Engineers. The real world experience that these folks have give them a unique perspective on how to create Operations specific tools that begin to address the thorny problem space around Monitoring and Reporting. Since these folks are charged with (among other things) automating the collection and reporting out of the monitoring that we do we thought they should get first crack at the answer. Here is their response:
As you might imagine, we collect a ton of data. To put it in real terms, on an average day we collect over 60,000 event log events, 185,000,000 performance counters, perform over 11,500,000 availability tests, parse 1.7 TB of IIS logs, collect asset and configuration information on 2200 servers and gather database statistics on over 2000 databases. While that’s an impressive amount of data, hopefully you’re asking one very important question; “What the heck do you do with all that stuff??” I’ll answer that two ways.
On one hand, we do a lot with the data. In addition to using the constant stream of detail data as it comes in for real time monitoring, we aggregate the data from all of these data sources into a large data warehouse nightly. From this, we provide daily availability reports (both internal and external availability), asset management and performance trend reports and application event level reporting. By taking nightly snapshots of the relationships of servers to clusters to sites, etc, we can take this data and answer questions like “How many servers were associated with x application on y date and what were they, how were they performing, what events and errors were they experiencing and what was the overall availability?” Likewise, we can do this over time all the while maintaining the date specific contexts.
On the other hand, we don’t do nearly as much with the data as I’d like to. We have an incredible opportunity to not only learn from the data, but have the data itself actually “teach” us what is interesting about it. As you might have guessed, there’s much more to that last sentence. In the interest of the length of this post, I’ll reserve that as the subject for another time.
We encourage you to follow up with more in-depth questions.
Comments
- Anonymous
October 11, 2005
Wow!! Very neat stuff! Okay, so you collect tons of data from all over the place... event logs, perfmon counters, logs, availability tests, DB stats, etc.. Then you aggregate this stuff in a data warehouse.
So let me start with a stupid question... How?
1) I guess you rely on a combination of WMI scripts, and SMS inventory information to pull this stuff daily. So... you're pulling all of this data out, and then what? What are you actually doing to "warehouse" this data? Are results from queries written directly to database tables in the "warehouse"? Let me take a step back... for all of us mid-sized environment admins... what does the data warehouse look like? Is this a fancy term for backend SQL server?
2) How do you collect availability information for a given asset (Exchange for instance)... what constitutes an availability check, and how do you determine the responsiveness of your servers... is this a statistic, like average time a message sits in the queue? What is the actionable data here? - Anonymous
October 12, 2005
Thanks for sharing!
For pulling and storing things like Event Logs and Performance counters do you use MOM/SMS and extend that or do you use custom built tools?
Monitoring and reporting is obviously a huge topic but also something that we all struggle with on a daily basis (different scales though) so please provide more detail if you can. - Anonymous
October 12, 2005
The comment has been removed