Writing a Fast and Versatile SPSC Ring Buffer – Performance Results
In the previous post I proposed an API for a ring buffer that can be used for a variety of different types of data. I also suggested that it’s possible to implement without sacrificing efficiency compared to ring buffers that only support one type of data.
The secret was supposedly keeping thread local state and shared state in separate cache lines and only accessing the shared state when you absolutely have to. So how big a difference does this make?
I wrote a benchmark which writes a billion integers into a 1 megabyte ring buffer on one thread while simultaneously reading the integers on another thread. Here are the results from my MacBook Pro:
Cache line separation | Integers per Finish call | Time in seconds |
---|---|---|
No | 1 | 36.0 |
No | 16 | 7.8 |
Yes | 1 | 1.7 |
Yes | 16 | 1.2 |
Note: You can find the full implementation of the buffer here. I left out some details in the last post, so you may notice that LocalState
contains additional variables. I store the buffer pointer and size redundantly in the reader and writer state, since this uses less memory than sharing that data, which would introduce an extra cache line. Pretty unintuitive!
I removed alignment support for the benchmark and that shaved off a fraction of a second. The versions with and without cache line separation are identical except for the memory layout of the ring buffer state, which obviously makes a huge difference.
The second column is how often FinishWrite()
and FinishRead()
are called. There’s a decent incremental improvement in calling them less frequently even when the data is laid out well.
Checking out the competition
I was curious how these results compare to other implementations, so I looked at one described here, which is a fairly simple fixed element size ring buffer. I changed the element type to int, fixed various typos in the code and ran the same test of pushing through a billion integers. This takes a little over 100 seconds.
What’s going on? You might notice that both ring buffers use std::atomic
but my version uses “release” (and “acquire”) memory order while the other uses the default memory order. The default is “sequentially-consistent ordering” which offers some strict guarantees about the globally observed order of changes to variables. Those guarantees shouldn’t be needed for SPSC ring buffers. Both versions are “lockless” but there are lots of subtleties in how to write fast lockless algorithms.
Changing the other ring buffer’s memory order makes it run in about 30 seconds. I haven’t looked closely at whether it’s safe, but it worked well enough for the benchmark. There appeared to be an off-by-one error originally, but after replacing the <= size
comparison with < size
the same billion values come out as were put in.
I can improve the speed further by replacing the modulo of the size with & (size - 1)
based on the fact that I know the size is a power of two. That brings the time down to about 9.5 seconds. Beyond that there’s not much I can do without essentially rewriting the code. Even though it looks simpler, it’s actually quite a bit slower than my proposal for a ring buffer that supports variable element sizes.
Boost
Boost has another fixed element size SPSC ring buffer implementation. There’s padding so the read and write positions occupy different cache lines, but there’s also a lot of contention between the threads since they access the positions simultaneously. It runs the billion integer test in 42 seconds out of the box.
I figured that this was probably because of poor inlining, since the code seems fairly well written, so I tried again with -Ofast
optimizations. For the other tests I ran release builds with default settings and manually forced the buffer functions to be inlined. Xcode defaults to optimization level -Os
which balances speed with small size. -Ofast
brings the time down to about 12 seconds.
The Boost ring buffer does support writing arrays, so you can create an spsc_queue
of char
and serialize various types of data that way. However, you can’t directly copy C++ objects into the buffer and use them in-place when reading from the buffer.
Memcpy
It looks like the approach I describe is pretty fast compared to alternative ring buffer implementations, but how close is it to the theoretical limit? For example, the speed of memcpy
. I created a benchmark that copies integers in 1 gigabyte blocks, first into a shared buffer on the producer thread and then out of the shared buffer on the consumer thread. This takes about 1.1 seconds per billion integers.
Note that I’m cheating a little, since the 1 megabyte ring buffer I used is small enough to stay in the L3 cache on my test hardware (but not L2 cache). The memcpy
speed is for uncached memory, since I’m copying a gigabyte at a time. Like with the other tests, I try to get a reliable measurement by running it in a loop and only looking at the result after it’s stabilized.
Conclusion
It turns out that ring buffers can get pretty close to memcpy
speed. Here is a summary of the results of writing a billion ints on one thread and reading them on another thread:
Ring buffer | Notes | Time in seconds |
---|---|---|
spsc_lockless_sequential.cpp |
Original | >100 |
spsc_lockless_sequential.cpp |
Optimized | 9.5 |
spsc_queue.hpp |
Xcode default | 42 |
spsc_queue.hpp |
LLVM -Ofast |
12 |
RingBuffer_v1.hpp |
1 int at a time | 1.7 |
RingBuffer_v1.hpp |
16 ints at a time | 1.2 |
memcpy() |
Gigabyte blocks | 1.1 |
It’s obviously quite a stress test that both the producer and consumer thread are constantly accessing the buffer, so in other scenarios the differences would be smaller.