To Inline or not to Inline: That is the question
In a previous posting, I mentioned that .NET V3.5 Service Pack 1 had significant improvements in the Just in time (JIT) compiler for the X86 platform, and in particular its ability to inline methods was improved (especially for methods with value type arguments). Well now that this release is publically available my claim can be put to the test. In fact a industrious blogger named Steven did just that and blogged about it here. What he did was to create a series of methods each a bit bigger than the previous one, and determined whether they got inlined or not. Steven did this by throwing an exception and then programmatically inspecting the stack trace associated with the exception. This makes sense when you are trying to automate the analysis, but for simple one-off cases, it is simpler and more powerful to simply look at the native instructions. See this blog for details on how to using Visual Studio.
What Steven found was that when tried to inline the following method
public void X18(int a)
{
if (a < 0 || a == 100)
{
Throw(a * 2);
}
}
It did not inline. This was not what Steven expected, because this method was only 18 bytes of IL. The previous version of the runtime would inline methods up to 32 bytes. It seems like the JIT’s ability to inline is getting worse, not better. What is going on?
Well, at the heart of this anomaly is a very simple fact: It is not always better to inline. Inlining always reduces the number of instructions executed (since at a minimum the call and return instructions are not executed), but it can (and often does), make the resulting code bigger. Most of us would intuitively know that it does not make sense to inline large methods (say 1Kbytes), and that inlining very small methods that make the call site smaller (because a call instruction is 5 bytes), are always a win, but what about the methods in between?
Interestingly, as you make code bigger, you make it slower, because inherently, memory is slow, and the bigger your code, the more likely it is not in the fastest CPU cache (called L1), in which case the processor stalls 3-10 cycles until it can be fetched from another cache (called L2), and if not there, in main memory (taking 10+ cycles). For code that executes in tight loops, this effect is not problematic because all the code will ‘fit’ in the fastest cache (typically 64K), however for ‘typical’ code, which executes a lot of code from a lot of methods, the ‘bigger is slower’ effect is very pronounced. Bigger code also means bigger disk I/O to get the code off the disk at startup time, which means that your application starts slower.
In fact, the first phase of the JIT inlining improvement was simply to remove the restrictions on JIT inlining. After that phase was complete we could inline A LOT, and in fact the performance of many of our ‘real world’ benchmarks DECREASED. Thus we had irrefutable evidence that inlining could be BAD for performance. We had to be careful; too much inlining was a bad thing.
Ideally you could calculate the effect of code size on caching and make a principled decision on when inlining was good and bad. Unfortunately, the JIT compiler has does not have enough information to take such a principled approach. However some things where clear
1. If inlining makes code smaller then the call it replaces, it is ALWAYS good. Note that we are talking about the NATIVE code size, not the IL code size (which can be quite different).
2. The more a particular call site is executed, the more it will benefit from inlning. Thus code in loops deserves to be inlined more than code that is not in loops.
3. If inlining exposes important optimizations, then inlining is more desirable. In particular methods with value types arguments benefit more than normal because of optimizations like this and thus having a bias to inline these methods is good.
Thus the heuristic the X86 JIT compiler uses is, given an inline candidate.
1. Estimate the size of the call site if the method were not inlined.
2. Estimate the size of the call site if it were inlined (this is an estimate based on the IL, we employ a simple state machine (Markov Model), created using lots of real data to form this estimator logic)
3. Compute a multiplier. By default it is 1
4. Increase the multiplier if the code is in a loop (the current heuristic bumps it to 5 in a loop)
5. Increase the multiplier if it looks like struct optimizations will kick in.
6. If InlineSize <= NonInlineSize * Multiplier do the inlining.
What this means is that by default, only methods that do not grow the call site will be inlined, however if the code is in a loop, it can grow as much as 5x
What does this mean for Steven’s test?
It means that simple tests based solely on IL size are not accurate. First what is important is the Native size, not the IL size, and more importantly, it is much more likely to be inlined if the method is in a loop. In particular, if you modify Steven’s test so that the methods are in a loop when they are called, in fact all of his test methods get inlined.
To be sure, the heuristics are not perfect. The worse case is a method that is too big to be inlined, but is still called A LOT (it is in a loop) and calls other small methods that COULD be inlined but are not because they are not in a loop. The problem is that the JIT does not know if the method is called a lot or not, and by default does not inline in that case. We are considering adding an attribute to a method which gives a strong hint that the method is called a lot and thus would bump the multiplier much like if there was a loop, but this does not exist now.
We definitely are interested in feedback on our inlining heuristic. If Steven or anyone else finds real examples where we are missing important inlining opportunities we want to know about them, so we can figure out whether we can adjust our heuristics (but please keep in mind that they are heurisitics. They will never be perfect).
Comments
Anonymous
August 19, 2008
PingBack from http://informationsfunnywallpaper.cn/?p=1361Anonymous
August 19, 2008
PingBack from http://housesfunnywallpaper.cn/?p=1556Anonymous
August 19, 2008
I think more JIT hints DEFINITELY need to be added. Some of the main ways to influcence JIT compilation that would be:A number of inlining attributes, such as the MSVC __forceinline keywords, would allow us to have better control and optimise for when we know better than the JIT.
An attribute saying how much we want the JIT to optimise. If I have code that'll run in a tight loop (such as the AI logic in a game), and I need to squeeze every cycle out of it, I can afford an extra second or two during a load screen for the JIT to spend optimising it. The attribute could be an enum of optimisation levels, or a time threshold in milliseconds for which to spend optimising, or something similiar. This could also apply to NGEN.
Following on from the point above, I'm not sure if something like this exists yet, but provide a method that would ensure a method is JITted, similiar to the RuntimeHelpers method that calls class constructors. This, combined with an optimisation level as an overloaded argument, would be very powerful.
Anonymous
August 19, 2008
-1 for adding attributes/hints. I don't want my code littered with compiler directives! How about using adaptive optimisation which makes improvements as the code runs? I would rather the compiler figure out what's really happening at runtime than force a developer to describe what they think will happen. JimAnonymous
August 19, 2008
The comment has been removedAnonymous
August 21, 2008
How does this heuristic differ from the one before the service pack?Anonymous
August 21, 2008
Thanks Vance for your fast response and letting us know what's happening under the hood. >> Increase the multiplier if it looks like struct optimizations will kick in. Can you tell more about the heuristics of these struct optimizations? I'm very interested in them!Anonymous
August 21, 2008
If you are interested in learning more about the struct (value type) inlining work see the JIT teams blog at http://blogs.msdn.com/clrcodegeneration/archive/2007/11/02/how-are-value-types-implemented-in-the-32-bit-clr-what-has-been-done-to-improve-their-performance.aspxAnonymous
August 21, 2008
The inlining heuristic used before was relatively simple. Most things under 32 bytes of IL got inlined IF they did not hit limitations in the linliner. One sigificant lmitation is value types and another was that only one conditional branch was allowed and no loops. There were other limitations that occur more rarely... Those limiations are gone now (and in fact we are relativley agressive about inlining value types now). I think it is fair to say that we tend to inline less then we did before outside of loops, but significantly more that we used to if we are in a loop. VanceAnonymous
August 27, 2008
I'm a bit curious what the "typical code" looks like you test CLR performance with. I'd guess that it's mostly the kind of application where performance isn't that important in the first place. I've yet to see any computational intensive code (e.g. simulations, numerical linear algebra, text/media processing) where (even forced) inlining actually had a negative effect. Improved code locality and new optimization opportunities almost always trump any negative effect of the increased code size. I don’t understand why some people argue against attributes that could hint to the JIT that a developer really, really wants a particular method inlined or that the JIT should spend extra time optimizing a particular method. If you find attributes ugly, don’t use them. If you have concerns about their security or about performance when executed in the web browser of your mobile phone, disable them (with an appropriate global option). The currently employed heuristics are far from perfect. On the other hand, manually measuring whether inlining improves performance is a trivial task. So why not give the users the choice? Now, if you took performance really seriously you would give users the option to compile their CLR code with a full-blown native-code compiler, ideally one supporting profile-guided optimizations (think Ngen with the optimizer back-end from Visual C++).Anonymous
August 27, 2008
stephan++. I am under the impression that Phoenix will soon be able to (or already can) do PGO with .Net assemblies. Combine this with its awesome MSIL to native compilation, and that's pretty much what stephan describes.Anonymous
August 27, 2008
The comment has been removedAnonymous
September 22, 2008
static T whatever = null; public T Whatever { get { if (whatever == null) { // loads of code whatever = ... } return whatever; } } It might give a nice boost if the JIT were able to inline everything but the if statement's body. Because this is what usually gets executed, except the first time. I have no numbers that could indicate how large an impact this would make. But I do know that we use "globals" via ThreadStatic scopes a lot. There are few of them, but those get executed all the time. I bet a lot of LoB apps are like that (application/transaction context etc). I leave it to more gifted people to figure out how this could be detected. (structurally? runtime profiling?) Just an idea. Would it even make sense to take a similar approach as TraceMonkey in the CLR? Or is the JIT so fast already that tracing would introduce an disproportionate overhead?Anonymous
September 22, 2008
Thinking about it, it should be quite simple to manually structure those methods so that the default inlining strategy does just that. Maybe this is just an awareness thing. if (whatever == null) InitializeWhatever(); return whatever;Anonymous
September 25, 2008
First of all, the inlining issue is not yet fixed for x64: see my comment in this feedback item: https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=93858 With respect to the heuristics: I think the heuristics make sense, but there should be some way to override them with an attribute. That might not be the most elegant solution, but some of us need that performance NOW and not in 5 years when a profile-guided optimizer for the CLR JIT is available. Another problem is that the code generation of the JIT is just horrible. A major point of inlining is that it enables subsequent optimizations. But when you look at the generated assembly code, you often see many completely redundant instructions. For example, in floating point code you often see completely unnecessary pairs of stores/loads in tight loops like this: 0000001b fstp qword ptr [ebp-10h] 0000001e fld qword ptr [ebp-10h] Not only does this destroy the performance, but it also makes the native code larger and thus prevents the inlining heuristics from kicking in.Anonymous
September 25, 2008
One more thing: loops are not the only way for a piece of code to be called multiple times. For example if you do some kind of tree traversal usually the best way to do it is recursion. I think a heuristic should be added that if a method calls itself recursively, the same boost as with a loop applies. You need this information anyway for doing tail call optimization, right?Anonymous
October 07, 2008
在.NET平台里,大部分编译器的优化并不是通过VB和C#编译器来完成的。它们宁可把优化的处理推后到CLR的即时(JustInTime,JIT)编译器读取IL,并转换为原生机器码的时候来完成。由于这...Anonymous
October 27, 2008
在.NET平台里,大部分编译器的优化并不是通过VB和C#编译器来完成的。它们宁可把优化的处理推后到CLR的即时(JustInTime,JIT)编译器读取IL,并转换为原生机器码的时候来完成。由于这...Anonymous
March 25, 2009
The comment has been removedAnonymous
February 25, 2010
Thank you for this article Vance. Could you explain me the following surprising behavior: this method is NOT inlined public static float ConvertCoordinateFromDegreeToMm ( float coordinateValue, float radius ) { return coordinateValue * radius * DEGREE_TO_RADIAN_COEF; } whereas this one is! public static float ConvertCoordinateFromDegreeToMm ( float coordinateValue, float radius ) { float result = coordinateValue * radius * DEGREE_TO_RADIAN_COEF; return result; } I'm using VS 2008 pro, .Net 3.5 SP1 on 32 bit Win XP pro, and my cpu is a core2 duo P8600. I've tried calling the method from a loop or not, and with different sizes of the calling site, I get the same results. I'm checking inlining within VS disassembly window, having unchecked "Suppress JIT optimization on module load". Thank you for your answer!Anonymous
July 16, 2015
I'm surprised the following is not being inlined unless using the 4.5 attribute AggressiveInlining: static int Main(string[] args) { return DoSomething(10000000); } static int DoSomething(int repetitions) { int trues = 0; for (char i = (char) 0; i < repetitions; i++) { if (IsControl(i)) trues += 1; } return trues; } public static bool IsControl(char c) { return ((c >= 0 && c <= 31) || (c >= 127 && c <= 159)); } N.B. This is derived from stackoverflow.com/.../709537 , adding the loop (to boost likeliness of inlining) and actually making use of the return value (to prevent complete elision).