Quick info about a great SIMD writeup

Hi, folks. I wanted to put together a more coherent version of a random Twitter conversation I had with Sasha Goldshtein.

First, you should go read Sasha's excellent write-up on SIMD: https://blogs.microsoft.co.il/sasha/2014/04/22/c-vectorization-microsoft-bcl-simd/. He's done an excellent job of talking about how scalar & vector operations compare, and really dug in deep to performance of the generated code.

Okay, now let me explain a few things. Sasha's looking at the code quality coming out of the preview: we know this code quality is less than optimal. Here's what I'm seeing from the same code with a compiler I built about 30 minutes ago (Siva, one of the JIT devs, just got CopyTo recognized as an intrinsic function):

 ; parameters: RCX = a, RDX = b (as before)
    sub    rsp,28h
    xor    eax,eax                          ;rax = i (init to zero)
    mov    r8d,dword ptr [rcx+8]            ;r8d = a.Length
    test   r8d,r8d
    jle    bail_out                         ;r8d <= 0?
    mov    r9d,dword ptr [rdx+8]            ;r9d = b.Length
loop:
    lea    r10d,[rax+3]                     ;r10d = i + 3
    cmp    r10d,r8d                         ;we still have to range check the read from the array of &a[i+3]
    jae    range_check_fail
    mov    r11,rcx
    movupd xmm0,xmmword ptr [r11+rax*4+10h] ;xmm0 = {a[i], a[i+1], a[i+2], a[i+3]}
    cmp    r10d,r9d                         ;and we have to range check b[i+3], too...
    jae    range_check_fail                 ;this one too :-(
    movupd xmm1,xmmword ptr [rdx+rax*4+10h] ;xmm1 = {b[i], b[i+1], b[i+2], b[i+3]}
    addps  xmm0,xmm1                        ;xmm0 += xmm1
    movupd xmmword ptr [r11+rax*4+10h],xmm0 ;{a[i], a[i+1], a[i+2], a[i+3]} = xmm0
    add    eax,4                            ;i += 1 * sizeof(float)
    cmp    r8d,eax                          ;a.Length > i?
    jg     loop
bail_out:
    add    rsp,28h
    ret
range_check_fail:
    call   clr!JIT_RngChkFail (00007ff9`a6d46590)
    int    3

So the most egregious code quality issue is gone (we're going to try and get an update out quickly: I'm saying mid-May, but things could change :-/). That gives me numbers like this:

 Scalar adding numbers: 765msec
 SIMD adding numbers: 210msec

MUCH better. But Sasha makes another interesting point: why did we create two versions of the scalar loop, but left those pesky bounds checks in the SIMD loop? Well, for starters, the two functions aren't identical. The SIMD version will actually throw a bounds check failure exception. To fix that, we actually need to change (some might say "fix" :-)) the code:

 static void AddPointwiseSimd(float[] a, float[] b)
{
 int simdLength = Vector<float>.Length;
 int i = 0;
 for (i = 0; i < a.Length - simdLength; i += simdLength)
 {
 Vector<float> va = new Vector<float>(a, i);
 Vector<float> vb = new Vector<float>(b, i);
 va += vb;
 va.CopyTo(a, i);
 }
 for (; i < a.Length; ++i)
 {
 a[i] += b[i];
 }
}

And now you might be expecting some great revelation. Nope. Sasha's right: we're missing some fairly straight-forward bound checks here. And we're not dual-versioning the loop, like RyuJIT does for normal scalar loops. Both items are on our to-do list. But the original code Sasha wrote would probably never have those optimizations. Hopefully, we'll get our bounds-check elimination work strengthened enough to make this code truly impressive. Until then, you'll have to limp along with only a 3.6x performance boost.