NGen Overview
NGen Overview
I thought it would be useful to provide a
primer on the NGen tool and pre-jitting your code for performance reasons.
In particular, there are some gotchas you must be aware of when authoring your
product. In this entry, I'm going to cover some background material on
paging (which you can skip if you are an expert
already). Then we'll cover the workings of the
NGen tool, some
servicing implications, and finally some future
directions.
Before we get started, let me keep up a Microsoft tradition
and include the key takeaways right here. If
you get nothing else out of this topic or can't read the whole thing, make sure
you absorb the following:
|
|
Paging Primer
Windows
uses a virtual address space on your machine, so for a 32-bit system you get
from 0 to 4GB of addressable memory for each process. Windows code is typically compiled into a Portable
Executable file (PE file), which contains sections of code and data marked with
page attributes like read, write, and execute. When the OS loads such a
file into a process, it maps the memory from your file into physical pages that
can be addressed by the process. So far so good?
On the x86, calls to methods are typically in
the form of "call address", where address is an absolute
value from 0 to 4GB, and tells the CPU the precise location it should transfer
to. This poses a problem for the compiler, because it means that when the
user's file is loaded, it needs to know precisely where all of the methods it
will call inside that file live (not just relative to the start of the file, but
the absolute address in the entire process). There are two things that
kick in here to aid you:
Base Address |
This is the address you specify as a developer (either through your compiler (eg: /baseaddress in VB.Net or C#) or using the rebase tool) where you want your executable to be loaded. The compiler will now assume the file will get loaded there, and can now predict the absolute address of every method in the file. |
Relocs |
Just in case your file can't be loaded to that base address, say if someone is already loaded there, the compiler will emit a set of relocs in the file that tell the OS where absolute addresses are located in the image. If the file gets relocated to a new place in the process, the OS will now fix-up the addresses -- essentially adjust them to the new home of the code or data. This allows flexibility, but is also expensive; keep reading to find out why. |
Besides allowing the compiler to stitch
together your program, a base address gives you a predictable location for your
file to get loaded every time it is executed. This is important, because
if you have sections of the file (say all of your executable code) that are read
only, then we'd like to be efficient as possible on the machine and share those
pages between processes. The OS accomplishes this if your pages are marked
for read only and sharable. So if you have the code for strcpy from msvcrt.dll at
location 0x70124800, then the one page of physical memory where that code lives can be
viewed in all
of the processes on the machine that also need it, provided those process have
loaded the msvcrt.dll to the same address.
User Process 1 | Kernel Mapped Pages | User Process 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
See the advantage? Overall system memory
pressure goes down with shared pages because only one physical page is used no matter how many
times you load it. Also, speed of loading code goes up, because chances
are the system already has that file loaded in some other process on the
machine. This is typically referred to as a "warm startup", because the OS
has already loaded many of the pages you need, and doesn't have to
go out to disk to get them. So bottom line, sharing of pages between
processes is a GOOD THING.
I mentioned that having to relocate a file away
from its base address is a BAD THING for your shareable data. This loss of sharing is the reason.
If you cannot load at your preferred base address, then those addresses in those
otherwise sharable pages are now wrong. So the OS has to make a copy of
the page for your process, mark it write, and then fix-up all of the invalid
values. This is bad because it takes both more time to do this (slower load
times) and more space (for the extra unshared pages).
I should point out that some pages are, of course, intended
to be per-process. Your global data for example wouldn't make much sense
if you were sharing it with another running instance of your application!
But in general we try very hard to reduce the number of pages in the system
because of the high cost of the extra memory pressure.
Ok, all of this background is interesting, but
what does this have to do with Managed code and the CLR? First, we also
use the PE file format for managed code, so your VB.Net application will be
stored in the same file format as kernel32.dll. This allows managed
executables to appear anywhere you would normally expect. For example if you want to do a CoCreateInstance on your managed code, or do a LoadLibrary directly,
you can do so.
This file format choice means we have to follow the same rules for assigning base addresses. And
guess what? We made the metadata and IL your compiler generates read only +
sharable so we could use the same memory management benefits you get with
unmanaged code.
Now think about what the JIT compiler does for
a minute. It just-in-time compiles your program one method at a time. That
means we allocate, on the fly, some memory and write the necessary native code
for your program out to
that location. When we need to call a method, we know where we put it in
the absolute address range, so we can do the same
"call address" you saw
for unmanaged code. The advantage of the JIT is that it can literally
stitch your program together as you go, and it only compiles the code that
you actually execute. But since this is happening on the fly, all of those pages where this code is allocated
are for that process only. We get none of the sharing advantages
you got with unmanaged code in read only + sharable pages, and it also takes
time to run that compiler. We did some experiments early on in the Runtime
as proof of concept for our managed C++ compiler which included recompiling Word as an
IL image. It worked great! But it was slow. Office is a big
application, and using the JIT for this case didn't put our best foot forward.
User Process 1 | Kernel Mapped Pages | User Process 2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Wouldn't it be great if you could get the same
page sharing advantage as unmanaged code, and not have to run the JIT every time
for a big application like Office? That's the NGen tool, and we'll drill
into that in the next section.
NGen stands for "Native Image Generator".
The tool allows us to run the JIT compiler on all of your IL in an assembly (a
PE file) at one sitting, and cache the results out to disk. Now when you
want to load and run that assembly, we can find it in the cache and load it just
like an unmanaged image. Because the code is read only + sharable, you get
the same benefits of page sharing.
So what precisely is in that image that gets created?
Let's look at the contents:
Header |
All PE files contain the standard set of headers, and an NGen image is no different. |
Native Code | Obviously this is the key thing we are trying to get into the image, and does make up the bulk of the image size. The code persisted is 100% native at this point, so that the JIT does not need to get involved to execute it. |
No Metadata or IL | The current NGen produced image does not have a copy of the metadata or the IL in it. This is significant, because it means that you will need to have both the original IL Assembly and the NGen image loaded at once. In general we try to avoid touching the metadata and IL at runtime, but you can't always avoid it. Two examples are late bound programming (eg: Reflection, which needs name information form the metadata) and JIT'ing of non-NGen'd code (the IL is read to see if it can be inlined). |
Fix-up Tables | The CLR requires more than just code to execute. It must have access to key data structures which describe things like Class and Method layouts. These are only known at runtime. We want to reduce overall writable pages in a process. To accomplish this, NGen stores a table of pointers to this data which will be allocated at run time. This allows NGen to generate one version of the code that will work unmodified for all processes, because there is a predictable location in the image where you can find the pointer to the dynamically allocated data (essentially a slot in this table). However, this technique has the down side of (1) slowing down startup to fill out the table, and (2) generating sub-optimal code which must use a pointer indirection to get the data it needs. Finally, it also means that we cannot simply persist the output of the JIT compiler itself while it runs, because the actual code generated is different in the two scenarios. |
Even with some of the trade
offs mentioned here, we've seen some remarkable performance
wins from this technique (and it only gets better each new release). There are, however, some
considerations you need to make before you jump on board the NGen bandwagon.
We'll cover those now.
Performance Win? |
Measure, measure, measure. You should always verify that this is a win for you. First, you should be writing either a shared library (like the BCL itself) or a client application that would really benefit from this kind of win. You must go try your app with and without to make sure it is worth the effort. It may not always be. For example in a Server scenario, where the application runs a long time, you can amortize the cost of jitting over the run of your server. Combine that with lack of sharing across AppDomains and NGen isn't a win. |
To Cache or Not to Cache |
Generating an NGen image takes time. You will be compiling all of your code at once into the final binary. The larger the file, the longer this takes. We do this for the .NET Framework during installation, and you can see the pause. In our case it makes sense: all of your applications will run that much faster because of this. You should decide if your application can handle this kind of wait. You may not want to do this for dynamic web content in a browser for example. Who wants to wait for the compile to finish for it to come up once? And will you ever run the same program as is again? |
Brittleness |
The MSDN documentation gives you the command line arguments and usage of the tool (which comes with the distribution). You should read very carefully through the section on brittleness. As an example, the ngen'd image is tightly coupled to the version of the Framework you compiled against. If that version is serviced (we ship a Service Pack for example), then your image will not be loaded, and your application will automatically fall back to jitting. To be clear: your code will still run, but it will not take advantage of the speed improvements you measured. This is something we are spending a lot of time addressing in the next version of the product. |
So you've decided to ngen your image. Now
what? This section contains some steps you should be taking:
Start with MSDN | Make sure you read all of the documentation on MSDN. I will only pull out highlights here. |
Picking a Base Address |
Pick a good set of base addresses for your PE files. The NGen'd image will get placed right behind your IL image in the process. You need to allocate enough space between your IL PE files for this image to be loaded. The general guideline is to allocate at least 3x the original size of the IL image (so for example if your IL assembly was 1 MB large, you should allocate a 3MB total range for that assembly plus it's ngen'd image). You should take a look at the size of your NGen'd images and verify you have enough space, not only for what you ship, but for some reasonable amount of growth if you ship a bug fix release of the file. |
When to NGen? |
You need to pick when you want to invoke the tool. For the distribution, we invoke ngen as a final step during setup. This is the best approach in most cases, because your application will start fast from the first time it is run. However, this will consume space on the user's machine, so if you think a particular application, or component, that you ship may not be run often (or at all), then you might consider deferring ngen to when the application starts the first time. For example, you could schedule a windows timed task to compile it at night after the first time the code is run. |
Servicing |
When you release bug fixes to customers in your managed code, you will need to regenerate the ngen'd images as well. This is pretty simple to do, just run the ngen command again. But you need to make sure it is covered with the setup/patching feature you are shipping. |
Uninstall |
Remember to use ngen /delete to remove your unneeded assemblies from the cache when you uninstall your application. Currently the CLR will remove all assemblies tied to a version of the framework on uninstall of the .NET FX, but it doesn't try to figure out when you've uninstalled just your application. |
As mentioned above, there are brittleness
issues with ngen in V1.0 and V1.1 (aka Everett). So you need to plan out
what you will do in the face of those things changing. As an example, we
will release a service pack of the CLR at some point, and your cached ngen
images will no longer load. Your code will still work, but it will run
under the jitter which will be slower (you did measure to verify you needed ngen,
right?).
Right now fixing this is tricky. Expect
us to improve this situation in the future, but for
now, here are some ideas on how you can address this:
Setup/Patching |
Make sure your setup and patching programs are doing the right thing. If you ship a fixed version of your IL code, you need to re-run ngen on those files for it to be up to date. |
Poor Man's Service | You can periodically run a scheduled task to check your images and re-ngen them as required. If you already have some kind of nightly enterprise script running on client machines, as an example, this would be a fine time to do maintenance. Note: if your images are already up to date, the NGen tool will simply report that and exit instead of doing a lot of unnecessary work. |
Rocket Science | If you are really motivated, you could go find the list of natively loaded PE files in your process (use the Win32 PSAPI API or walk the PEB) to see if your NGen'd image was actually loaded in the process. If it wasn't, most likely it means you need to fix it up, and your app could do so for the next run. I might prototype this at some point, but suffice to say it isn't a trivial thing to do. |
At this point you've probably looked through the list of
Servicing Hints and thought to yourself:
"Wow that's kinda ugly!" And you're right. NGen for Version 1.0 and
1.1 was primarily designed and engineered for internal use by the CLR itself.
When we install SP's of our stuff, we force a re-ngen of all of the core
components, which keeps that part of your app running fast.
Going forward, Ngen is still a
key foundation for our performance story. It gives you the working set wins (better page
sharing, quicker loading) that are required for starting your application
faster. It also allows for more aggressive optimizations in the compiler.
If we tried doing really aggressive optimizations every time you ran the JIT,
you'd actually run slower just waiting for the compiler to finish.
Expect in the future that we will be addressing the
clumsiness and the servicing issues so your life is easier. Here just a
few things we're thinking about:
ngen /repair |
We'll be talking about a feature called "ngen /repair" at the October 2003 PDC in LA next month which dramatically simplifies fixing up the cached images. |
New API's | There are some cleaner ways we could expose the fact that your application is out of date. It would make it simpler to write your app if it could query this state, or force it to correct automatically. We are considering these designs now. |
Double Loads | As mentioned above, the current CLR loads both the IL image and your NGen image in the process. This double loading is inefficient because it makes the OS loader do more work (slower startup time). Look for us to try to avoid this in the future. |
Indirections | As mentioned above, NGen images still contain a lot of fix-up tables for dynamic data structures. This causes the startup to be slower (while those tables are fixed up) and generates sub-optimal code which must use the indirection of the table to get at the data. Look for us to get more aggressive and avoid a lot of this. |
And finally in closing, make sure to
re-read those key take aways.
There are some important links you may be interested in
reading:
Anonymous
September 24, 2003
Very informative article. I never realized that there was so much to NGen. I just thought that it was doing the same thing as the CLR when it JITs, only not at the app's runtime.I look forward to more articles like this from you. This was a very good post. :)Anonymous
September 26, 2003
Thanks Jason that was a great post :) Would love to see some posts on the Rotor JIT, how it compares to the standard JIT and what issues are involved with the JIT when porting the PAL.Anonymous
September 26, 2003
Good suggestion, we should definately be able to provide that kind of content. thanks!Anonymous
September 30, 2003
Good stuff ! Would love to see articles and posts like this, maybe these could be hosted on http://www.sscli.net (I do think it should have a Rotor RSS aggregator). Along with a few others I am currently doing some research on a Rotor port (hench my interest :)Anonymous
October 01, 2003
Congratulations for this great article. There are loads of good information here JYou mentioned at one point that earlier you compiled the Word in IL but it was slow. Did you also ngened it? If yes how big was the performance gain?Thanks!Anonymous
October 07, 2003
Good question Krisztian. At the time we did this experiment in V1.0, NGen didn't actually exist (we were designing and writing the tool then). So we did not do that experiment, and have not since tried recompiling MS Word. We have, of course, tried numerous other applications to gain our data. If we do try it again, I'll post results here for folks to look at.Anonymous
February 13, 2004
I wish that MSDN pointed to this article when they cite that "pre-compiling a Windows Forms application" can lead to faster startup times. They just mention that it can help without any of the considerations that I found here... (I've since taken it out of my installation procedures, even though it did offer faster start time for my applications - it didn't seem worth the risk to an upgrade path to future versions of the Framework)Anonymous
January 11, 2007
PingBack from http://dotnetdebug.net/2006/01/22/to-gac-or-not-to-gac-that-is-the-question/Anonymous
January 21, 2009
PingBack from http://www.keyongtech.com/604432-which-net-framework-settings-are