RyuJIT CTP5: Getting closer to shipping, and with better SIMD support
Hi Folks! Yes, we understand it’s been a while since we shipped the last RyuJIT CTP. We have been working hard on improving our SIMD support and getting RyuJIT to ship quality for the next version of the .NET Framework. So without further ado, here’s a quick description of what you can expect from RyuJIT CTP5.
We have spent a lot of time finding and fixing those last few pesky corner-case functional issues in RyuJIT. Fortunately, we have the luxury of having many internal partners with a significant managed codebase, making it easy to throw as much managed code as we can find at RyuJIT. While some of the issues we have found are legitimate bugs, others are not so clear cut. For example, we have found that JIT64 accommodates some illegal IL disallowed by the ECMA spec. Since backward compatibility is a major concern for us, we evaluate these issues on a case-by-case basis to decide if we should quirk RyuJIT to accommodate the same illegal IL.
Real-World Throughput Wins
In case you have missed the original blog post announcing the first CTP, RyuJIT beats JIT64 handily in terms of throughput while staying very competitive in terms of code quality (CQ). Recently the Bing team has tried using RyuJIT on top of 4.5.1 in some of their processing, and they see a 25% reduction in startup time in their scenario. This is the most significant real-world throughput win we have witnessed on RyuJIT thus far. :)
We didn’t publish any benchmark results with RyuJIT CTP4, so here are some graphs to show that we haven’t regressed CQ in RyuJIT CTP5. However, since CQ hasn’t been the focus for this CTP, we also haven’t made any significant improvements either.
These graphs follow the same basic format as previous ones. The higher the bar, the better RyuJIT CTP5 is at that benchmark. The grey area is the standard deviation, so any benchmark falling in the grey area is just noise.
What’s New in JIT Support for SIMD types?
RyuJIT CTP5 supports acceleration of the latest version of the Vector APIs available via NuGet here. This version contains a number of changes that were requested by developers.
One of the most popular requests was to publicly expose the fields of the fixed-size vector types (e.g. Vector2.X). Why wasn’t this done originally? The short answer is that it was for performance, but really it was to make it easier for the JIT to handle all the references to these types as intrinsics, and to transform them into the appropriate target instructions. It’s a tricky business, however, to determine where to allocate a local Vector instance for best efficiency:
- If the instance will be primarily used in Vector intrinsics, putting it in an xmm/ymm register is the best option.
- If the instance will primarily be referenced via its fields, then either putting it in memory, or separately allocating its fields to registers, is the best option.
- If the instance is larger than 8 bytes (i.e. not a Vector2), and it is primarily passed as a method argument, then putting it in memory is the best option.
With CTP5 we have made the JIT a bit smarter about identifying these field accesses, analyzing the usage of the vector instance, and selecting among these options, but there is still room for improvement, so you may find that some SIMD code runs more slowly with this new release.
We’ve also improved register allocation for SIMD types, reducing a number of cases where we had unnecessary copies of vector registers.
Since we are talking about SIMD performance, it wouldn’t be fair to not include any SIMD benchmark results. We are using the sample code here as our SIMD benchmarks. (However, note that we are using an updated version of RayTracer which uses our latest Vector APIs. We’ll update the sample shortly.)
Stay tuned – we are continuing to work on performance for SIMD types, including tuning of inlining heuristics for SIMD methods, and improved dead store elimination. We’ll also be diving into the usage data from Bing and other internal partners to see how we can improve the performance of RyuJIT even more on both throughput and CQ.
In case you need them again, you can refer to this blog post for the instructions to turn on RyuJIT, and this blog post for instructions on using SIMD. Note that if you are running on the 4.5.2 version of the .NET Framework, you can use RyuJIT CTP5 on Windows Vista, 7, 8, and 8.1 as well as Windows Server 2008, 2008 R2, 2012, and 2012 R2. However, RyuJIT CTP5 currently doesn't work on Visual Studio "14" CTP4. You don't need it anyway, since RyuJIT is enabled by default on Visual Studio "14" CTP4. :) (The version of RyuJIT in Visual Studio "14" CTP4 is slightly older than this CTP, but not by much.)