Writing a Fast and Versatile SPSC Ring Buffer

August 13, 2018 – Posted in Programming – Comments

I wanted to share some notes about how to write a circular queue/ring buffer designed for a single producer thread and a single consumer thread (SPSC). What’s more, I will try to describe a ring buffer that can be used for lots of different things and is still very fast. Quite often you want to use the buffer for commands that can have variable sizes. However, most ring buffers that I have seen published only seem to support fixed-size elements.

Ideally you should be able to send anything through the buffer without incurring a performance cost for that extra flexibility. You can have data that varies both in size and what alignment it requires. With a bit of luck, we can support that and still beat other ring buffers on speed. I am assuming that you’re using a language like C++ that lets you be explicit about things like memory layout and thread synchronization.

Another design goal is that data should be available for reading as soon as the producer wants it to be available. We don’t want to wait until the producer thread has filled out a pre-defined “page size” before we can consume it. There may be reasons why you would want to use pages, but here we are optimizing for low latency. Low response time is often more important than throughput in real-time applications like games and VR.

Usage

Let’s look at a sample use case. It turns out that you may want separate function calls for writing data and actually submitting the data to the consumer thread. Likewise, you may want separate calls for reading data and telling the producer that you finished reading it. For example, your producer thread can write commands like this:

// Send command to process array of items.
queue.Write(Command::ConsumeItems);
queue.Write(itemCount);
queue.WriteArray(items, itemCount);
queue.FinishWrite();

Processing the commands looks something like this:

Command cmd = queue.Read<Command>();
switch (cmd)
{
    case Command::ConsumeItems:
        int itemCount = queue.Read<int>();
        const Item* items = queue.ReadArray<Item>(itemCount);
        ConsumeItems(items, itemCount);
        queue.FinishRead();
        break;
}

The consumer thread is allowed to access the items in the buffer until it explicitly calls FinishRead(). This avoids having to copy the items out of the buffer. The finish functions don’t need to be called at the same frequency, but they do need to be called often enough that the buffer doesn’t fill up with unfinished data.

Wrapping

For simplicity, I will choose that the size of the buffer must be a power of two. It’s certainly possible to support non-power of two sizes and still make it fast, but it makes the wrapping logic that much more complicated. On the other hand, I will not cut corners on making sure that the buffer supports writing an unlimited amount of data. Sometimes that requires taking extra care with comparisons. It would be a tragedy if a size_t overflow caused the buffer to break down, even if that’s lot of bytes on 64 bit computers.

When you allow variable-sized elements, you have to decide what to do about elements that partially go past the end of the buffer. One approach is to use a magic ring buffer where the virtual memory past the end of the buffer is mapped to the beginning of the buffer. Another approach is to just leave a gap at the end of the buffer and start over from the beginning. I find that totally reasonable, especially if the buffer is large compared to the element sizes.

Note that if you leave gaps at the end of the buffer, it’s important that you read exactly the same element sizes as you originally wrote. If you write 4 bytes twice, you can’t read those elements as one 8 byte element, because that would behave differently when wrapping.

Buffer API

In the code example above, I introduced a Write() function to write a single element and a similar Read() function. There was another pair of functions for arrays, WriteArray() and ReadArray(). I obviously promised to be able to write various kinds of data, so these are all template functions that take any copy-contructable type.

The previous functions can be implemented using a pair of lower-level functions that prepare a number of bytes in the buffer for writing (or reading). They also need to be aware of the alignment, since certain types need to be aligned to N-byte boundaries.

Consider something like this:

class RingBuffer
{
public:
    // Allocate buffer space for writing.
    void* PrepareWrite(size_t size, size_t alignment);

    // Publish written data.
    void FinishWrite();

    // Write an element to the buffer.
    template <typename T> void Write(const T& value)
    {
        void* dest = PrepareWrite(sizeof(T), alignof(T));
        new(dest) T(value);
    }

    // Write an array of elements to the buffer.
    template <typename T> void WriteArray(const T* values, size_t count)
    {
        void* dest = PrepareWrite(sizeof(T) * count, alignof(T));
        for (size_t i = 0; i < count; i++)
            new(static_cast<T*>(dest) + i) T(values[i]);
    }

The new is the C++ placement new operator, which doesn’t allocate memory but initializes an unused piece of memory with a copy of an object. alignof is a C++11 keyword similar to sizeof except it returns the required alignment of a type.

Here’s what the API for reading would look like:

    // Get read pointer. Size and alignment should match written data.
    void* PrepareRead(size_t size, size_t alignment);

    // Finish and make buffer space available to writer.
    void FinishRead();

    // Read an element from the buffer.
    template <typename T> const T& Read()
    {
        void* src = PrepareRead(sizeof(T), alignof(T));
        return *static_cast<T*>(src);
    }

    // Read an array of elements from the buffer.
    template <typename T> const T* ReadArray(size_t count)
    {
        void* src = PrepareRead(sizeof(T) * count, alignof(T));
        return static_cast<T*>(src);
    }

Local and shared state

The first step to writing a fast ring buffer is not being clever with atomic operations, but thinking about what state is required by the producer and consumer. Each thread needs some information that is local to itself and some that is shared.

For optimal performance, you want to avoid reading data that may be simultaneously written by another thread. In the API above, you ideally want PrepareWrite() and PrepareRead() to only access local state, since it will be much faster when both threads are active. Local state can be kept fully in the L1 cache, or even in registers, while changes to shared state have to be expensively synchronized across different processors.

Separating local and shared state is trickier than you might expect because of the risk of false sharing where values that are close to each other in memory affect each other’s read and write performance. In situations like this, where the memory layout can make a big difference, you want to force each type of state to take up a multiple of entire cache lines.

Assume we have a define for the typical cache line size – a reasonably safe value to use on modern platforms is 64 bytes. An initial version of the producer and consumer threads’ local state might look like this:

    struct alignas(CACHE_LINE_SIZE) LocalState
    {
        char* buffer;
        size_t pos;
        size_t end;
    };

    LocalState m_Writer;
    LocalState m_Reader;

It makes sense that you want a pointer to the buffer memory and the writer or reader’s current position, but what about the end variable? You could store the opposite thread’s last known (finished) position, but then things get complicated when it goes past the end of the buffer and wraps to the beginning. What you want is a fast way to decide if you can write or read data right now with as little logic as possible. For that purpose it’s better that pos and end represent the window in the buffer that’s immediately available without wrapping or checking the other thread’s latest state.

The function for allocating buffer space for writing ends up looking something like this, with a single branch for handling non-trivial cases in a separate function:

void* RingBuffer::PrepareWrite(size_t size, size_t alignment)
{
    size_t pos = Align(m_Writer.pos, alignment);
    size_t end = pos + size;
    if (end > m_Writer.end)
        GetBufferSpaceToWriteTo(pos, end);
    m_Writer.pos = end;
    return m_Writer.buffer + pos;
}

Moving complexity like wrapping the position and waiting for buffer space to be available into GetBufferSpaceToWriteTo() means we can aggressively inline PrepareWrite() without bloating code size.

Alignment

There’s one thing left in the function above which is fairly annoying, and that is spending time aligning the position for every write. In trivial cases this may be optimized away by the compiler, if it can see that the previous write was aligned. You can also ignore alignment altogether if you never write data which has to be aligned to a buffer that might be misaligned.

Another approach that I quite like is rounding all sizes up to a multiple of a minimum alignment, e.g. 4 bytes, so you don’t need to realign the position for types whose alignof is less than or equal to that. This requires PrepareWrite() to be inlined so the compiler knows about the alignment at compile time.

Synchronizing state

Note that none of the code so far needed to synchronize values across different threads, because all the data was local. The only state that needs to be shared is the finished position for the writer and reader. I will be using std::atomic along with alignas to prevent false sharing:

    struct alignas(CACHE_LINE_SIZE) SharedState
    {
        std::atomic<size_t> pos;
    };

    SharedState m_WriterShared;
    SharedState m_ReaderShared;

FinishWrite() will look something like this:

void RingBuffer::FinishWrite()
{
    m_WriterShared.pos.store(m_Writer.pos, std::memory_order_release);
}

See this page on cppreference.com for a discussion about the different types of memory order. It’s worth noticing that memory_order_release is guaranteed by default on strongly-ordered platforms like Intel x86, so it doesn’t generate different code from a regular store or load. Using std::atomic is still a good idea for documenting the code and preventing the compiler from reordering writes and reads (which it is allowed to do for non-volatile variables).

Blocking

The read functions in the proposed API may have to block until some data was written, and the write functions may block because the buffer is full. An alternative API could have used TryWrite() and TryRead() functions, which are allowed to fail when running out of data or buffer space, but that gets tedious if you want to communicate any complex information through the ring buffer.

How do you actually implement waiting for the opposite thread? The simplest way is to just keep checking the shared position in a loop. This is mostly fine if you know the producer and consumer threads are running on different processors, although you will burn unnecessary CPU cycles.

Waiting in a tight loop is a bad idea if the producer and consumer happen to be scheduled on the same processor. In that case, you might spend a lot of CPU time on the wrong thread while the thread being waited on is prevented from running.

It turns out that this is solvable, but that the solution is not trivial. You can wait on an event or semaphore, though you only want to do that when you are actually blocked, since doing it all the time would be far too slow. The tricky part is how to request that the event should be signalled. I may get back to that in a later post.

Next up: Performance Results.