Larry gets taken to task on concurrency
In yesterdays blog post, I was taken to task, first by Bob Frankston, and then by Raymond Chen about comments I made, and I want to talk a bit about their comments.
Here's the relevant sections:
From Bob Frankston (yeah, the Bob Frankston, I'm honored):
I accidentally glanced at the comments and noticed the comment about 64 bit operations not being atomic. I guess I've been around too long and don't see any reason to assume any such operation is atomic. If the CLI doesn't make this promise then you can't assume it for any operation.
With HT processors we're going to see a lot of subtle bugs appear and they'll all be blamed on mysterious viruses or maybe ghosts left in the CPU from processes that died horrible deaths.
Seriously the current processors are the result of years of single threaded programming. Maybe we should make atomicity a well-defined part of the architecture instead of just assuming it because the program seems to work.
I responded:
Your point is actually a huge issue. On our currently supported 32bit processors, 32bit aligned memory operations are guaranteed (by the processor) to be atomic. However, in general, you're totally right - on 64bit processors this is explicitly NOT true - on those machines the rules on write ordering have been significantly relaxed, which can cause all sorts of issues.
[...]
Your comment about the ordering issue is why I suggested using the InterlockedExchangeAdd API - it asserts a memory barrier after the add to ensure that all (logical and physical) processors have a consistent view of the data.
Btw, the biggest likely item to fail in the presence of misordered writes is the Double Checked Lock Pattern - here's a great discussion of why it fails on modern processors: https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html
The problem is that there is a lot of existent code that uses it that will start breaking in really subtle ways.
And then Raymond chimed in (slightly edited):
"The problem with this is that 64bit math isn't atomic on 32bit processors."
Not even 32-bit math is atomic on 32-bit processors. Given
static int i = 0;
// Thread A
i += 2;
// Thread B
i += 3;
the final result might be i=2, or it could be i=3 or it could be i=5. if you want atomic arithmetic you must use an Interlocked function.
Of course, Bob and Raymond are totally right.
And this shows just how hard understanding concurrency issues is. In fact, both Raymond and I are correct in our assertions (he more than I because his claim is more conservative than mine). This is because he and I are talking about different aspects of the concurrency issue. My comments (about 64bit operations being atomic) has to do with the fact that a 32bit processor has no way of accessing 64bit addresses atomically. What that means is that when a 32bit processor writes a 64bit value, it does it in two memory operations.
And that means that between the two writes, another thread can come in and modify the value. And that leads to corruptions. This can't happen on current 32bit processors because they guarantee write ordering and write atomicity.
Raymond's comments have to do with memory accesses on machines with more than one CPU core. His scenario can't happen on the currently supported single cored 32bit architectures (it can on some of the 64bit architectures due to relaxed write ordering issues). But if you've got more than one cpu core (either a HT machine or two distinct CPUs), then you can absolutely see it happen.
The reason for this is that most arithmetic (and logical) operations are read-alter-write operations. They read a value from memory, perform some operation on it and then write the value back. If some other thread can come in during the read-alter-write operation, then any of the results that Raymond called out can occur. For instance, in his example above, thread B reads the value of i before thread A updates the value adds 3 to 0 and writes back 3.
Having said that, Raymond's scenario can fail even on single cored processors. You see, the reason his code doesn't fail on single cored processors is because he's adding a constant to a memory value. And the processor has an instruction for adding constant values to memory addresses:
add [i], 3
But it doesn't take very many changes to the code to make it break on single cored machines:
static int i = 0;
int j = 2;
int k = 3;
// Thread A
i += j;
// Thread B
i += k;
Then Raymonds scenario will fail on a single processor machine - the reason for this is that in order to perform the read-alter-write operation on two memory variables requires that the read-alter-write operation be explicitly performed:
move eax, [j]
add [i], eax
If there's more than one thread modifying j, then BAM!™ we've got a problem.
So, when I wrote my original comment, I was thinking about the memory write case - relying on the fact that on 32bit processors, a natively aligned 32bit memory read is guaranteed to be atomic (and for 64bit processors, a 64bit aligned memory read is similarly guaranteed to be atomic). On currently supported 32bit processors, however a 64bit memory read is guaranteed to NOT be atomic.
Please note that I'm calling out reads explicitly - memory writes are currently guaranteed to be atomic on the current crop of 32bit processors, but that is not the case for all the currently supported 64bit processors - they allow write misordering - see the linked paper about the double checked lock pattern above for more details of why this is a bad thing.
I'm going to spend a bit of time over the next few days talking about concurrency and the issues associated with concurrency. The good news is that with a relatively small set of common sense rules, most of the issues I talked about become irrelevant (unless you're looking to squeeze your scalability to the absolute highest levels).