Engineering the Windows 7 “Windows Experience Index”
We’re busy going through tons of telemetry from the many people that have downloaded and installed the Windows 7 beta around the world. We’re super excited to see the excitement around kicking the tires. Since most folks on the beta are well-versed in the hardware they use and very tuned into the choices they make, we’ve received a few questions about the Windows Experience Index (WEI) in Windows 7 and how that has been changed and improved in Windows 7 to take into account new hardware available for each of the major classes in the metric. In this post Michael Fortin returns to dive into the engineering details of the WEI.
The WEI was introduced in Windows Vista to provide one means across PCs to measure the relative performance of key hardware components. Like any index or benchmark, it is best used as a relative measure and should not be used to compare one measure to another. Unlike many other measures, the WEI merely measures the relative capability of components. The WEI only runs for a short time and does not measure the interactions of components under a software load, but rather characteristics or your hardware. As such it does not (nor cannot) measure how a system will perform under the your own usage scenarios. Thus the WEI does not measure performance of a system, but merely the relative hardware capabilities when running Windows 7.
We do want to caution folks in trying to generalize an “absolute” WEI as necessary for a given individual. We each have different tolerances or more importantly expectations for how a PC should perform and the same WEI might mean very different things to different individuals. To personalize this, I do about 90% of my work on a PC with a WEI of 2.0, primarily driven by the relatively low score for the gaming graphics component on my very low cost laptop. I run Outlook (with ~2GB of email), Internet Explorer (with a dozen tabs), Excel (with longs list of people on the development team), PowerPoint, Messenger (with video), and often I am running one of several LOB applications written in .NET. I feel with this type of workload and a PC with Windows 7 and that WEI my own brain and fingers continues to be my “bottleneck”. At the other end of the spectrum is my holiday gift machine which is a 25” all-in-one with a WEI of 5.1 (though still limited by gaming graphics, with subscores of 7.2, 7.2, 6.2, 5.1, 5.9). This machine runs Windows 7 64-bit and I definitely don’t keep it very busy even though I run MediaCenter in a window all the time, have a bunch of desktop gadgets, and run the PC as our print server (I use about 25% of available RAM and the CPU almost never gets above 10%).
The overall Windows Experience Index (WEI) is defined to be the lowest of the five top-level WEI subscores, where each subscore is computed using a set of rules and a suite of system assessment tests. The five areas scored in Windows 7 are the same as they were in Vista and include:
- Memory (RAM)
- Graphics (general desktop work)
- Gaming Graphics (typically 3D)
- Primary Hard Disk
Though the scoring areas are the same, the ranges have changed. In Vista, the WEI scores ranged from 1.0 to 5.9. In Windows 7, the range has been extended upward to 7.9. The scoring rules for devices have also changed from Vista to reflect experience and feedback comparing closely rated devices with differing quality of actual use (i.e. to make the rating more indicative of actual use.) We know during the beta some folks have noticed that the score changed (relative to Vista) for one or more components in their system and this tuning, which we will describe here, is responsible for the change.
For a given score range, we hope our customers will be able to utilize some general guidelines to help understand the experiences a particular PC can be expected to deliver well, relatively speaking. These Vista-era general guidelines for systems in the 1.0, 2.0, 3.0, 4.0 and 5.0 ranges still apply to Windows 7. But, as noted above, Windows 7 has added levels 6.0 and 7.0; meaning 7.9 is the maximum score possible. These new levels were designed to capture the rather substantial improvements we are seeing in key technologies as they enter the mainstream, such as solid state disks, multi-core processors, and higher end graphics adapters. Additionally, the amount of memory in a system is a determining factor.
For these new levels, we’re working to add guidelines for each level. As an example for gaming users, we expect systems with gaming graphics scores in the 6.0 to 6.9 range to support DX10 graphics and deliver good frames rates at typical screen resolutions (like 40-50 frames per second at 1280x1024). In the range of 7.0 to 7.9, we would expect higher frame rates at even higher screen resolutions. Obviously, the specifics of each game have much to do with this and the WEI scores are also meant to help game developers decide how best to scale their experience on a given system. Graphics is an area where there is both the widest variety of scores readily available in hardwaren and also the widest breadth of expectations. The extremes at which CAD, HD video, photography, and gamers push graphics compared to the average business user or a consumer (doing many of these same things as an avocation rather than vocation) is significant.
Of course, adding new levels doesn’t explain why a Vista system or component that used to score 4.0 or higher is now obtaining a score of 2.9. In most cases, large score drops will be due to the addition of some new disk tests in Windows 7 as that is where we’ve seen both interesting real world learning and substantial changes in the hardware landscape.
With respect to disk scores, as discussed in our recent post on Windows Performance, we’ve been developing a comprehensive performance feedback loop for quite some time. With that loop, we’ve been able to capture thousands of detailed traces covering periods of time where the computer’s current user indicated an application, or Windows, was experiencing severe responsiveness problems. In analyzing these traces we saw a connection to disk I/O and we often found typical 4KB disk reads to take longer than expected, much, much longer in fact (10x to 30x). Instead of taking 10s of milliseconds to complete, we’d often find sequences where individual disk reads took many hundreds of milliseconds to finish. When sequences of these accumulate, higher level application responsiveness can suffer dramatically.
With the problem recognized, we synthesized many of the I/O sequences and undertook a large study on many, many disk drives, including solid state drives. While we did find a good number of drives to be excellent, we unfortunately also found many to have significant challenges under this type of load, which based on telemetry is rather common. In particular, we found the first generation of solid state drives to be broadly challenged when confronted with these commonly seen client I/O sequences.
An example problematic sequence consists of a series of sequential and random I/Os intermixed with one or more flushes. During these sequences, many of the random writes complete in unrealistically short periods of time (say 500 microseconds). Very short I/O completion times indicate caching; the actual work of moving the bits to spinning media, or to flash cells, is postponed. After a period of returning success very quickly, a backlog of deferred work is built up. What happens next is different from drive to drive. Some drives continue to consistently respond to reads as expected, no matter the earlier issued and postponed writes/flushes, which yields good performance and no perceived problems for the person using the PC. Some drives, however, reads are often held off for very lengthy periods as the drives apparently attempt to clear their backlog of work and this results in a perceived “blocking” state or almost a “locked system”. To validate this, on some systems, we replaced poor performing disks with known good disks and observed dramatically improved performance. In a few cases, updating the drive’s firmware was sufficient to very noticeably improve responsiveness.
To reflect this real world learning, in the Windows 7 Beta code, we have capped scores for drives which appear to exhibit the problematic behavior (during the scoring) and are using our feedback system to send back information to us to further evaluate these results. Scores of 1.9, 2.0, 2.9 and 3.0 for the system disk are possible because of our current capping rules. Internally, we feel confident in the beta disk assessment and these caps based on the data we have observed so far. Of course, we expect to learn from data coming from the broader beta population and from feedback and conversations we have with drive manufacturers.
For those obtaining low disk scores but are otherwise satisfied with the performance, we aren’t recommending any action (Of course the WEI is not a tool to recommend hardware changes of any kind). It is entirely possible that the sequence of I/Os being issued for your common workload and applications isn’t encountering the issues we are noting. As we’ve said, the WEI is a metric but only you can apply that metric to your computing needs.
Earlier, I made note of the fact that our new levels, 6 and 7, were added to recognize the improved experiences one might have with newer hardware, particularly SSDs, graphics adapters, and multi-core processors. With respect to SSDs, the focus of the newer tests is on random I/O rates and their avoidance of the long latency issues noted above. As a note, the tests don’t specifically check to see if the underlying storage device is an SSD or not. We run them no matter the device type and any device capable of sustaining very high random I/O rates will score well.
For graphics adapters, both DX9 and DX10 assessments can be run now. In Vista, the tests were specific to DX9. To obtain scores in the 6 or 7 ranges, a graphics adapter must obtain very good performance scores, support DX10 and the driver must be a WDDM 1.1 driver (which you might have noticed are being downloaded in beta during the Windows 7 beta). For WDDM 1.0 drivers, only the DX9 assessments will be run, thus capping the overall score at 5.9.
For multi-core processors, both single threaded and multi-threaded scenarios are run. With levels 6 and 7, we aim to indicate that these systems will be rarely CPU bound for typical use and quite suitable for demanding processing tasks and multi-tasking. As examples, we anticipate many quad core processors will be able to score in the high 6 to low 7 ranges, and 8 core systems to be able to approach 7.9. The scoring has taken into account the very latest micro-processors available.
For many key hardware partners, we’ve of course made available additional details on the changes and why they were made. We continue to actively work with them to incorporate appropriate feedback.