The windows prefetcher
In my last blog I talked about some of the conditions than need to hold for the cold startup formula
ColdStartupTimeMSec = WarmStartupTimeMsec + 4 * NumberOfReads + 20 * NumberMBytes
To be accurate. I mentioned that if you have overlap between the CPU and the disk then it may not be accurate (although it would be an upper bound). It turns out that there is a feature of the OS that will actually makes the cold startup formula work all the time, even in those cases I call out above where it might not work. That feature is the OS prefetcher and is in windows XP and beyond.
If you look at the I/O caused by running a program at cold start, you will find that the disk tends to be accessed randomly. You need a page here, then a page in some other part of the file, as the CPU executes code from all over the place. This causes literally thousands of individual Disk accesses, and as we have seen, each disk seek costs us 4msec, so seek time dominates, and adds up to seconds of startup time.
The observation is that this 'lazy loading' of data from the disk is actually very inefficient because it loads the data in small pages and each page costs something like 4msec to move the disk head. It would be much better if you could read the data in big chunks, even if you end up reading in data that you did not end up using.
The prefetcher fixes uses this observation to speed up cold startup. On EVERY program load, the OS keeps track of what data that program needed to be paged in (even if they were not 'hard' page faults). This data is persisted in binary form in a *.PF file in c:\windows\prefetch (go look at that directory). When a process is created, the first thing that is done is that this *.PF file is consulted. It then issues ALL the file I/O specified in this file, at once and blocks the program until it competes. These I/O requests are nicely sorted and combined. If several I/O are adjacent (or even close to one another), a large combined I/O is issued instead (which avoids a seek), the I/O are also nicely ordered so that the disk head moves the smallest possible distance between reads. This makes the disk I/O as efficient as it can be, and because disk I/O is the slow part of cold startup time, it leads to dramatic improvements in cold startup time (often the prefetcher cuts cold startup time in half).
Note that the prefecher does not allow the program to begin until all the I/O is complete. I am speculating they did this because they did not want random I/Os from the executing program to interfere with the beautify optimized disk head movement. However this has a unfortunate consequence. What it means is that things like splash screens (which get put up early to alert the user that something is indeed happening), do not get up as quickly as you would like (because NO code in the process runs until ALL the I/O completes). For this reason, the prefetcher has a limit on the amount of data it will prefetch (on XP this is 16MB). This keeps the maximum pause caused by the prefetcher to a tolerable amount (By our formula above 16 MB * 20 msec /MB = 320msec + probably about 100 seeks or so which is still < 1 sec).
One side effect of the prefetcher is that for programs under its data size limit, the there can be NO overlap between CPU and Disk I/O during startup (the program is not allowed to run during the initial priming phase). So this is yet another reason that tends to keep the formula accurate.
Finally I just wanted to close by noting that the formula's usefulness is not so much that it is accurate in all cases, but rather that it is accurate enough and very simple, and thus can be used to reason about what can be done to improve cold startup. The formula states that you can attack cold startup by
- Attacking warm startup (and parallelizing CPU work is one way do doing this),
- Attacking the amount of data read of the disk (minimizing the amount of data needed to startup is the way to do this)
- Attacking the number of seeks (packing data so that it is not scattered about within a DLL does this)
It attaches real values to each of these components so that for a given scenario you can know how each component is contributing and thus which is the most profitable component to attack. This is its real value.
Comments
Anonymous
April 15, 2007
Made for an interesting reading. Will look out for more such articlesAnonymous
April 16, 2007
The comment has been removedAnonymous
April 16, 2007
As you point out, the exact forumla numbers are tailored to a speciific machine. Thus it estimates the cold startup for that particular machine. You should choose numbers that are most representative for your most important cases. It is interesting that when disks get slower they tend to get slower both in seek time and throughput (since both are related to disk rotational speed). Thus while the number is wrong, the tradeoff between putting effort into reducing seeks vs putting effort into reducing data volume remains the same (but both get bigger with respect to warm startup, but in most apps, warm startup is already quite small). Thus while the magnitude is wrong, it still points you in the 'right direction' as far as optimizing your application. For what it is worth.Anonymous
April 18, 2007
What happens when you do something like defragment that can move files around on disk? Does this invalidate the files in the Prefetch directory, or does the prefetcher have a way of detecting that the thing it read is different than the thing that was there? If it's the former and the person defragments often (I do because it seems to speed up my computer) it seems like you could end up with a lot of cold startups. If it's the latter, then you'd have to pay a heavier price: prefetching + cold startup.Anonymous
April 18, 2007
This is a very good question, because it points out that I did not explain the whole picture. First, the data in the PF file does not wire in hard disk offsets (imagine a set of relative offsets from the base of the file). Thus in fact the prefetcher will not to a great good job unless the file is in in one chunk (that is it is defragmetned). Typically when you do a bulk copy of a file (as would happen in typical software installation), the file is usually not fragmented (the OS goes to some trouble to avoid fragmentation if it can). However even if it was initially fragmented it turns out that about every 3 days, the windows scheduler will run the defrager ON JUST THE FILES MENTIONED IN PREFETCH FILES. This insures that even if the file starts out fragmented, it will not remain so for long. In short, the Prefetcher does not interact badly with the defrager (in fact it needs it).