General Races and Data Races

In a multithreaded program, races are rally bad. The reason for this is that a program is compiling correctly, tests are passing and it runs in production for months, and then, all of a sudden, the program returns a result that is everything but the expected one. Two maybe most known race bugs are the Therac-25 radiation machine which ended up killing more than one person and American Blackout of 2003 where the race bug resulted in 50 million people losing their electric power. To show how bad a race bug can be is the fact that, in the case of the radiation machine, the bug was first introduced in the older version, Therac-20, but was newer detected and was simply copied to the new, Therac-25 model.

Determinism

Before we take a look at races and what they are we should remind our selves of some basic concepts, deterministic and non-deterministic execution. A program executes deterministically when on the same input it always outputs the same result regardless of how the instructions are scheduled on the multicore/multiprocessor system. Basically, the program always behaves the same way on the same input. On the other hand, the output of a nondeterministic program on the same input changes from run to run.

It is important to say that non-deterministic programs are not necessarily incorrect, they can be non-deterministic by design. A multithreaded program is always non-deterministic in some sense.

Races

Generally, when there is some shared memory between multiple threads that are not coordinated, races represent situations where the correctness of the output on the same input is affected by the timing of instruction execution and/or ordering of those operations on that same memory location. To put it simply, with a race bug, our program will run correctly only if we are lucky (that is something like winning the lottery level lucky). Without proper synchronization between the threads that share some state, there can be basically two types of a race, general race and data race. The primary distinction between them is in the types of problems they introduce.

Data Race

A data race (sometimes called access anomaly)  is a bug that happens in a program that is non-deterministic by design but which can produce incorrect results due to an invalid state of some variable. Generally, a data race occurs when there are multiple threads without coordination accessing shared memory location and at least one of them does a write operation.

long timestamp;

void updateTimestamp(long newTimestamp) {
     this.timestamp = newTimestamp;
}

long getTimestamp(){
   return this.timestamp
}

Java long and double variables are 64-bit values but there are (often) implemented as two 32-bit values. This means that a write to a long/double variable is actually two write operations to two adjacent memory locations which represent single long/double value (check JLS chapter 17.7). Because of this, in this example there is a data race bug: we have a shared variable timestamp that can be read and written to by multiple threads. Let’s say thread T1 wants to update our timestamp and it updates the first 32-bit part of the value and the OS scheduler suspends it. After that other thread T2 wants to read timestamp by calling getTimestamp() but now gets a corrupted value.

Non-determinism in this example is that we don’t know when and by who timestamp will be updated (and we do not care) but this data race bug will produce corrupted state for our non-deterministic program. Introducing proper synchronization in this example will not remove that non-determinism but just keep the shared memory in a correct state.

To fix this bug we can make timestamp to be a volatile variable (writes to volatile variables are always atomic) or we can use AtomicLong object from java.util.concurrent.atomic package.

volatile long timestamp;

The problem introduced in the above example is known as out-of-thin-air safety (the variable gets some random because of incomplete update). Another common problem with a data race bug is so-called stale-data bug, where the update of some variable by one thread is not visible to other threads. Meaning, that a thread that reads that shared variable sees the stale or out-of-date value of the variable.

With data races, proper synchronization ensures that all operations in a critical section on some shared state are executed atomically by a single thread. This basically means that when thread T starts to operate on a shared object, nothing can change that object in the middle of its execution. And besides that, proper synchronization ensures that updates are visible to other threads.

General Race

When we have, or when we want a deterministic execution, general race bug (also known as a race condition, critical race or determinacy race) causes non-determinism in the program execution. This basically means that because of some random, non-deterministic, thread scheduling done by the OS, our computation may produce the wrong results.

if(foo == 5){
   doOnlyIfFooIsFive();
   foo = 0;
}

In this example, our deterministic execution means that we want to run doOnlyIfFooIsFive() only when the variable foo is 5 (variable  foo is a shared variable, visible to all threads). But if this piece of code is run by multiple threads we have a general race bug.

The bug here is that, let’s say thread T1 evaluates that foo is indeed 5 and right after that the OS scheduler suspends it and gives the CPU time to some thread T2. Then T2 runs and executes the same code but it runs to completion, meaning it had CPU time to perform all operations including setting the shared variable foo to 0. Now the scheduler runs T1 again from the point where it was suspended (evaluating that foo == 5 is true) and here we have a problem. At this point in time, foo is no longer 5, it was set to zero by T2 and now T1 executes doOnlyIfFooIsFive() method that should not be executed. Here, the timing (which we can not control in any way) or rather, unlucky interleaving of particular operations gave us the wrong result of the whole computation.

This example, in particular, also known as a check-then-act operation, is maybe the most common situation for a general race bug to happen.

A fix for this problem is to coordinate checking and acting upon that check with a mutex (Java explicit Lock in this example).

m.lock()
if(foo == 5) {  
   doCriticalSomething();
   foo = 0; 
}
m.unlock()

With general races, proper synchronization ensures that operations are executed in a correct and expected order, that way providing us with the determinism which we want. Important to note is that these operations themselves can be synchronized but failing to execute them in a correct order can still produce a general race bug. In our example above, reading and writing to a variable foo can be atomic (with no data race) but that still does not prevent a general race bug from happening.

Summary

Both, general race and data race bugs can make our multithreaded program fail in unpredictable ways. Although sometimes it can be confusing to make a difference between them, the main distinction is that a general race is introduced where the order of particular operations (regardless if they are synchronized or not) is violated with unlucky interleaving and timing giving us the wrong results. With a data race, the order is not important but when some shared state is modified that the modification is executed in a single atomic operation which prevents that our program runs with the corrupted state. Proper synchronization is needed to eliminate both of these.

If you want even deeper understanding, I highly encourage you to check this article.