How To Make Your SPSC Ring Buffer Do Nothing, More Efficiently
In my original post Writing a Fast and Versatile SPSC Ring Buffer I suggested that there was a better way of waiting for data to be written, and buffer space to be available, than just continually checking the other thread’s progress. Now I am going to elaborate and get into the dirty details.
A marginal improvement is to tell the processor that’s doing the waiting to use fewer resources. On Windows you would use YieldProcessor
which on Intel CPUs translates to the pause
instruction. You can find similar instructions on other architectures.
Yielding can free up cycles for another thread running on the same physical core (which means that the CPU uses hyper-threading), but it doesn’t help if the thread you are waiting on isn’t currently running. Dealing with that requires whoever is scheduling the threads – typically the operating system – to be aware of the dependency. In your program, you use synchronization objects like semaphores to make a thread go idle and wake it up again when you have some useful work for it.
For those who are unfamiliar with semaphores, they essentially allow threads to wait for a signal from another thread. They are more expensive than the atomic operations I used in the original ring buffer implementation, but the advantage is that once you are waiting you no longer spend CPU cycles doing so because the thread is suspended.
Ideally we only want to use semaphores when the producer or consumer thread is actually blocked. In other words, the buffer should run locklessly when not waiting and switch out of lockless mode to wait.
Setting up the semaphores
In the previous version, each thread’s shared state was simply a position. It looks like we are going to need a semaphore per thread as well, and a bool
to keep track of whether the other thread is waiting for a signal.
struct alignas(CACHE_LINE_SIZE) SharedState { std::atomic<size_t> pos; std::atomic<bool> shouldSignal; Semaphore semaphore; };
The idea is that we are going to check the bool every time we finish writing something or reading something:
void RingBuffer::FinishWrite() { m_WriterShared.pos.store(m_Writer.base + m_Writer.pos, std::memory_order_release); if (m_WriterShared.shouldSignal.exchange(false)) m_ReaderShared.semaphore.signal(); }
The call to exchange
returns the existing value of shouldSignal
while also resetting it to false
.
Requesting a signal
Instead of checking the written position in an indefinite loop, we now sleep on the consumer thread and ask the producer to wake us up the next time it has written some data:
for (;;) { size_t writerPos = m_WriterShared.pos.load(std::memory_order_acquire); size_t available = writerPos - m_Reader.base; // Signed comparison (available can be negative) if (static_cast<ptrdiff_t>(available) >= static_cast<ptrdiff_t>(end)) { m_Reader.end = std::min(available, m_Reader.size); break; } m_WriterShared.shouldSignal.store(true); m_ReaderShared.semaphore.wait(); }
The producer thread goes through the same steps, but with the reader and writer state swapped. The code is symmetrical except for the fact that the writer starts out with an empty buffer to write to, so its version of available
adds the size of the buffer.
Pseudocode review
Let’s revisit the steps in FinishWrite
/FinishRead
:
- Store position
- Check if other thread is waiting
- Signal other thread if needed
And similarly GetBufferSpaceToReadFrom
/GetBufferSpaceToWriteTo
:
- Read other thread’s position
- Return if there’s enough data (or empty buffer space)
- Request that the other thread signals us
- Wait for signal
- Loop to 1
This looks sane… or does it?
Seeing it all fall apart
You may have noticed an issue with the algorithm, especially if you are a veteran in lockless programming. It turns out that it’s possible for a thread to miss a position update before going to sleep, while the thread that was supposed to wake it up misses the request to do so. For example, the consumer could check for incoming data just before the producer updates its progress, and the producer could then check whether it should send a signal before the consumer requests one.
Let’s go through that scenario again using bullet points:
- Consumer reads producer’s position
- Producer updates position
- Producer checks if consumer needs a signal
- Consumer asks for a signal and goes to sleep
In lockless code, there are no guarantees as to how the instructions running on two different threads will be interleaved, so you have to assume that they will be interleaved in the worst possible way. In fact, it’s even worse since in this case both threads may see an older version of the opposite thread’s position.
I said earlier that the positions are written and read using release/acquire ordering, which only guarantees that they are consistent with the data they refer to. It does not mean that the values are visible instantaneously. Sadly, it’s perfectly possible for both threads to write out their new positions, check each other’s positions, and fail to see the latest progress. As you can imagine, this leads to deadlocks.
Changing the position to use a stricter memory order means that only one thread would initially get stuck, but that’s not a great solution since one stuck thread can still cause a deadlock. For example, you might wait for the consumer thread to finish a particular task by sending a signal back to the producer when it’s done. However, if the consumer thread got stuck while receiving the task data, you are going to wait forever if you wait around without sending more data.
Another scenario for a deadlock is when the producer waits due to the buffer being full, and the consumer frees up the buffer quickly enough that it doesn’t notice the producer waiting. This would likely only happen if you filled up the entire buffer with one big element. In this case the consumer never wakes up the producer thread, which never produces more data for the consumer thread, which means both threads are forever stuck waiting.
Picking up the pieces
Fortunately, there is a solution. It involves checking the position again after requesting a signal and before going to sleep:
for (;;) { size_t writerPos = m_WriterShared.pos.load(std::memory_order_acquire); size_t available = writerPos - m_Reader.base; // Signed comparison (available can be negative) if (static_cast<ptrdiff_t>(available) >= static_cast<ptrdiff_t>(end)) { m_Reader.end = std::min(available, m_Reader.size); break; } m_WriterShared.shouldSignal.store(true); if (writerPos != m_WriterShared.pos.load(std::memory_order_relaxed)) { // Position changed after we requested signal, try to cancel if (m_WriterShared.shouldSignal.exchange(false)) continue; // Request successfully cancelled } m_ReaderShared.semaphore.wait(); }
In case we missed an update, we see if we can cancel the request for a signal by changing the bool back to false
. We then jump back to the beginning of the for
loop and check if we got the data we needed. It’s possible that the producer already saw the request and preemptively woke us up, so in that case we fall through to the semaphore wait()
before starting over.
The reason this works is that either the producer’s position is written out before the consumer sets shouldSignal
to true
, which means the consumer gets the latest position in the second attempt or the producer sees that shouldSignal
was already set to true
and signals the consumer.
Anything that happens before or after setting/checking shouldSignal
will be observed to happen in that order on either thread. This is because it uses the default memory order for atomic operations, which is stricter than the ordering used for the positions.
Call for comments
Checking the position again seems like a fairly obvious solution to a race condition in a lockless ring buffer, and almost a little too easy. I assume that others have used the same approach with varying degrees of confidence in its robustness.
Part of my motivation for publishing this is to have more people examine it and either tell me why it shouldn’t work or that it does indeed work. I wrote a ring buffer based on the same concept for Unity’s multithreaded rendering, so even if it isn’t supposed to work in theory, at least I can prove that it works in practice.
It’s quite likely that there is literature out there on how to block when a lockless data structure is empty or at capacity, or indeed that Dijkstra wrote a paper about it in the 1970s (Dijkstra invented the semaphore concept). However, it seems like the kind of problem that almost everyone ignores.
Conclusion
I find it intensely satisfying that you can get the benefits of a lockless ring buffer while also using proper synchronization for waiting instead of keeping the CPU busy. It feels better to not burn a lot of cycles without accomplishing anything.
Compared to the original algorithm, calling FinishWrite()
and FinishRead()
has become more expensive because they now use an atomic exchange operation – in addition to the atomic store with release ordering, which is just a regular store on Intel CPUs. Expect about an order of magnitude difference. The best usage of this ring buffer is to not call those functions for tiny amounts of data, but amortize the cost over slightly bigger chunks so you can still push data through fast.
I think that despite the added cost, this version is a significant improvement if you care about stable performance and are not in complete control of the environment you are running in. Unless you always have free processors for both threads to run on, you can get pathologically bad performance hiccups when the thread that’s allowed to run has to wait on the thread that’s not running.
In case you really care about performance, look for situations where you don’t need to guarantee that the data is consumed right away. For example, when you know that you will always write more data later, even if the consumer thread nodded off the last time you sent it some data.
Source code
The full source code is here. It depends on a semaphore class, e.g. Jeff Preshing’s which is here.