Statistical Process Control Techniques in Performance Monitoring and Alerting

Being focused on the upcoming release of Visual Studio 2010 for the past six months or so, I, unfortunately, have been neglecting to blog about it. Before I get back to the series of blog posts I started about writing in parallel programming, I thought I’d first answer the mail.

 

Concerning a recent presentation by Charles Loboz, my colleague at Microsoft, at CMG09 in Texas (DEC 2009), Uriel Carrasquilla, a very knowledgeable and resourceful performance analyst at NCCI in Florida, writes,

“Mr. Loboz indicated that the theoretical calculations were based on Windows reporting of CPU busy and CPU queue length. My results indicate that the CPU queue length reported by Microsoft can't be correct. I found that other CMG researchers came up with the same conclusion.

“I used similar ideas for my Linux, AIX and Sun Solaris data as reported by SAR, and Mr. Loboz ideas work like a charm.

“Question: are you aware of this problem with Microsoft performance reporting? Is anybody working on this issue?”

 

Uri,

The short answer is,

“The derivation and interpretation of the System\Processor Queue Length performance counter is well-documented in the Windows 2003 Server Performance Guide, published back in the Windows 2003 Server Resource Kit. I believe the Processor Queue Length performance counter continues to be a very useful metric to track, as Charles and his team that is responsible for capacity planning for the many of the Microsoft online properties do.”

n Mark

 

I will post a more expansive answer soon, allowing me to expound a little on a question that gets asked quite frequently, namely, “How are measures of CPU utilization in Windows derived and how can they be interpreted?”

First, though, I’d like to mention some of the work Charles Loboz and his team have been doing in the context of capacity planning to support some of the massive applications Microsoft provisions and supports. Consider an application like Hotmail that supports something in the neighborhood of 500 million mailboxes (give or take a couple hundred million) and a customer base that is global in scale. That is an order of magnitude larger than the largest corporate entity responsible for a single e-mail or messaging infrastructure. (My guess is that the largest corporate entity responsible for a single e-mail infrastructure is the US Department of Defense. Although, it might be the US Army instead since the different service branches probably operate separate infrastructures.) Performance monitoring and capacity planning on the scale of Hotmail or Search is certainly unprecedented. Do you think performance and capacity planning are important in an application the size of Hotmail. The answer is, “You bet.” The investment in hardware and power consumption alone justifies the capacity planning effort.

I had an opportunity to see some of the material that Charles was working on back in the summer & gave him some feedback on measurements and what valid inferences can be drawn from them. I haven’t read the final published version, but, I am certainly in sympathy with the approach he has adopted. (BTW, people like Uri that attended the recent CMG Conference have access to Charles’ paper, but no one else at the moment. As soon as Charles posts it somewhere publicly, I will point this blog entry to it.)

Although the scale Charles has to deal with is something new, the approach isn’t. I remember I also sent Charles a pointer to Igor Trobin's work, which I believe is very complementary. Igor writes an interesting blog called “System Management by Exception.” In addition, Jeff Buzen and Annie Shum published a very influential paper on this subject called “MASF: Multivariate Adaptive Statistical Filtering” back in 1995. (Igor’s papers on the subject and the original Buzen and Shum paper are all available at www.cmg.org.) My colleague Boris Zibitsker has also made a substantial contribution to what I consider a very useful approach, namely applying statistical process control (SPC) techniques to mine for gold within the enormous amounts of performance data that IT organizations routinely gather.

For perspective, Carnegie Mellon’s Software Engineering Institute (SEI) is usually credited with the original application of SPC techniques to software engineering. Len Bass at SEI wrote an excellent book entitled Software Architecture in Practice that embraces a broader perspective on quality in software development that I share. Len’s work on software quality metrics is close to my current interests here in Developer Division, especially around the potential value of scenario-driven development processes. (More on scenarios in the next blog post. Len’s submitted to a brief interview on Channel 9 recently that is posted here.)

Within the application life-cycle, performance, unfortunately, is considered one of the non-functional requirements associated with a system specification, which often means it is relegated to a secondary role during the much of the application life-cycle. In the specification process, getting the business requirements and translating them into system specifications correctly is the most pressing problem for developers of Line of Business applications. Performance is one of those aspects of software quality that often doesn’t get expressed during the software development life cycle until very late in the process when design flaws that lead to scalability problems are very expensive to fix.

Len Bass’s suggestion is that the requirements definition of a scenario should include a response time specification that can then be monitored throughout the development life cycle, just like any other set of requirements. That is the approach that we advocate using here in the Microsoft Developer Division for the software products that we built here. In developing Visual Studio 2010, for example, we made major commitments to performance requirements and regularly conduct automated acceptance testing against those requirements. However, you can also see from the many recent blogs on VS 2010 performance coming from the Microsoft Developer Division that we have not exactly gotten this down to a science yet.

The Len Bass and SEI approach is informed by experience building real-time control systems to fly airplanes, for example, where performance goals absolutely have to be met or the system cannot function as designed. The performance requirements for real-time control systems applications are fundamentally easy to specify. If the computer system doesn’t recognize a condition and respond to it in time, the plane is going to crash. Bass makes the case for thatsystem performance being one of those important Quality Attributes that needs to be addressed at the outset of the development life cycle, beginning with the architectural specification and continuing through the design, development and QA processes, to the delivered software’s operational phase, where it finally becomes the focus of performance analysts and capacity planners like Charles Loboz, Uriel Carrasquilla, and Igor Torbin.

What if you need to specify performance requirements for your LOB application, but don’t know where to start? Consider these two approaches:

· Research in human factors engineering has generated a set of performance requirements for specific types of human-computer interactions in order to promote usability and improve customer satisfaction. Steve Seow, another colleague here in Microsoft, has an excellent, concise book on this topic called “Designing and Engineering Time,” complete with application responsiveness guidelines to help improve customer satisfaction. If you are in a position to design a new application from scratch from First Principles, Steve’s book will be an invaluable guide.

· If the application currently exists in some form or another, measure its current performance. When you deliver the next version of the application, any significant decrease in performance from one release to the next will be perceived as an irritant and received negatively by existing users. In other words, measure the scenario of interest on the current system & use that as a baseline that you won’t regress in a subsequent version.

If you have to start somewhere, measuring current levels of performance around key scenarios and using them as a baseline gives you a place to start, at least. My experience is that the current level of performance sets expectations that the next version of the application must meet if you want your customers to be satisfied. In this context, Steve Seow's book cites psychological research into how much of a response time difference is necessary to be perceived as a difference. (About 20% in either direction makes a difference.) This reminds me of Gregory Bateson’s adage that “information is a difference that makes a difference.”

I do think that over time, humans adapt themselves to the response times they experience, such that, eventually, the response times of the new version become the new baseline. In other words, our positive or negative perception tends to atrophy over time. For example, consider the last time you acquired a new desktop or portable computer that was noticeably faster than its predecessor. How long was it before that rush of enthusiasm for the fast, new machine started to diminish? About 30 days, in my experience.

Twenty-five years ago, I was in a similar position to Charles, responsible for performance and capacity planning at a large telecommunications company for maybe 20 IBM mainframe computers, which was considered a whole lot of machines to keep track of back in those days. We used a product called MICS (full disclosure, I was a developer on MICS for a brief period in the mid-80s) to warehouse the performance data we were gathering from these machines and the SAS language for statistical reporting. Subsequently, at Landmark Systems, I designed a “management by exception” feature for our monitoring products that our customers loved based on very simple statistical process control techniques. Today, for Charles’ team that needs to monitor performance on 100,000s of servers, these statistical techniques are the only viable approach.

But, of course, Uri is correct. You do have to choose the right metrics. I believe Charles has. I will discuss the CPU utilization metrics in Windows in my next post.

 

-- Mark Friedman