Stress vs. CHESS

Hey there... it's time for another post. My name is Sebastian Burckhardt, I am a member of the CHESS team, and I want to tell you about some experiments I ran today. The question that interests me here is how does CHESS compare to stress testing ?

First, let me clarify the difference between the two. In general, the purpose of stress testing is to find all sorts of deficiencies under heavy load, including performance problems, excessive memory consumption, or insufficient reliability under failures. For example, we may create stress by using very large files or databases, or by processing a very large number of requests.

In contrast, CHESS specifically targets concurrency errors such as race conditions, synchronization errors or deadlocks. Even though such errors have nothing to do with stress per se, programmer often use stress testing to find them. Why is that so? I believe the answer is that stress can create unusual interleavings. When under heavy load, the relative speed of threads changes because some operations are slowed down while others are not. This helps to uncover synchronization errors. Thus, we can consider improved interleaving coverage a (secondary) effect of stress testing. Clearly, it's not the original purpose of stress. Nor is arbitrary creation of "stress" very efficient, as we will see later.

CHESS on the other hand is all about systematic coverage of interleavings. It's its primary purpose! So, while it will definitely not find problems related to files being too large, it does a terrific job at finding concurrency errors.

We're now going to try to find a bug with both stress testing and with CHESS (here is where the fun starts). First, go ahead and download & install CHESS (or just keep reading). Then, open the Account example (use VS to open Samples/Managed/BankAccount/Bank.sln). It contains the buggy "Account" class:

image    image

The bug is in the Withdraw() method: the Read() should happen inside the critical section. Here is the unit test that is designed to this bug:

     [TestClass]
    public class TestBank
    {
        [TestMethod]
        public void WithdrawAndDepositConcurrently()
        {
            var account = new Account(10);
            var child = new Thread(
               o => { (o as Account).Withdraw(2); }
               );
            child.Start(account);
            account.Deposit(1);
            child.Join();
            
            Assert.AreEqual<int>(9, account.Read());
        }

When running this test, a child thread is created. The original thread and the child thread then both make method calls to the same account object (calling Deposit(1) and Withdraw(2), respectively). As the starting balance was 10, the ending balance should be 9 if all goes well. However, because of the bug, there is a chance that we end up with a balance of 8 (because of the following scenario: the child thread calls Withdraw(2), reads a balance of 10, then gets preempted; the original thread now calls Deposit(1), updating the balance to 9. Finally, the child thread resumes, writing a balance of 8).

What are our chances of hitting the bug with this test?

As you may have guessed, they are not very good. You can try running this test a couple of times... it never failed for me. So, next step: stress. We'll start by simply repeating the test, first 1 thousand times, then 10 thousand times, then 100 thousand times. Note that we're using Release build type here (you may want to use smaller numbers for Debug build type).

 [TestMethod]
public void Repeat1k()
{
    for (int i = 0; i < 1000; ++i)
        WithdrawAndDepositConcurrently();
}
 [TestMethod]
public void Repeat10k()
{
    for (int i = 0; i < 10000; ++i)
        WithdrawAndDepositConcurrently();
}
 [TestMethod]
public void Repeat100k()
{
    for (int i = 0; i < 100000; ++i)
        WithdrawAndDepositConcurrently();
}

We also add a CHESS test. Note that we need no loop here. CHESS has stress "built-in" because it already tries to systematically cover all interleavings, implicitly repeating the test many times. In fact, a loop would be a bad idea, since it would overwhelm CHESS.

 [TestMethod]
[HostType("Chess")]
public void RunOnceWithChess()
{
    WithdrawAndDepositConcurrently();  // no need to write a loop
}

Now, let's run this (choose Tests->Run->All Tests in Solution from the menu) and look at the results, sorted by test duration.

image

As you can see, the original test (WithdrawAndDepositConcurrently) and the shortest stress test (Repeat1k) did not find the bug, but CHESS (RunOnceWithChess) and the two larger stress tests (Repeat10k, Repeat100k) both did.

Note that CHESS was a little faster than stress. Still, stress testing seems to work. So why do we need CHESS?

The answer is: because even under stress, the scheduler tends to heavily favor some interleavings over others. Specifically, there is some basic notion of fairness among threads, which makes it unlikely that a thread that is ready to execute is held back for a very, very long time.

We can demonstrate this by making a small change to the Account class, by adding a call to WasteTimeOnMeaninglessComputation() to the Deposit() method.

 public void Deposit(int amount)
{
    WasteTimeOnMeaninglessComputation();
    lock (this)
    {
        balance = balance + amount;
    }
}

public void WasteTimeOnMeaninglessComputation()
{
    double dummy = 0;
    for (int i = 0; i < 50000; i++)
    {
        dummy += Math.Sin(i);
    }
    System.Console.WriteLine(dummy);
}

Of course, the Account class is just as broken as it ever was. We have just added some arbitrary delay to the thread calling the Deposit method. We can easily imagine how that may happen in a real program (for example, some heavy initialization may take place there, or the user may instrument the program to output lots of trace information to the console, maybe even in an attempt to debug a Heisenbug).

 

The interesting thing is that this little change throws off the stress testing:

image

Even with a 100k repetitions (which now take a lot longer), stress testing did not find the bug. Of course, CHESS is completely unimpressed by our change and finds the bug just as quickly as before.

So, what's the point? Is stress a bad thing? No. Clearly it does work, at least sometimes. Maybe if we try a couple more times, maybe if we run more tests, maybe if we change some numbers, maybe if we run on a different machine we'll find the bug (some teams run stress tests for days). Indeed, we tried the same tests and some small variations on a couple machines here and sometimes the bug was found. But then again, maybe we will never find it (there is no guarantee that running forever will eventually try all schedules, because the operating system scheduler is not a perfectly random thing). It's just not reliable, and definitely not quick. Also, it's not fun to reproduce and fix a bug that manifests only after days of running on some special machine.

I believe this simple experiment is making a strong case for using CHESS to find concurrency bugs in unit tests.