Skip to content

Archive for August, 2014

Multi-Threaded Programming 5: A Lock-Free Queue Implementation

In the last post, we showed an example of a locking queue that used two locks in order to safeguard against being used by multiple threads. For this one, we are going to see what it takes in order to turn that into a lock-free queue and discuss why each of those changes has to take place.

Again, this is implemented using C++ and uses C++11 techniques compiling under Visual Studio Express 2013. This is partially a pet project to get more familiar with the C++11 conventions of doing things, so let me know of any thoughts regarding how I am doing in that regard.

Disclaimer: Any code that is provided is for learning purposes only. I make no guarantees that it will work for your specific project or platform. It has been subjected to some standard tests that it has passed on two x86/64 systems (one Intel and one AMD). All code is released under the MIT license, so feel free to use it however you see fit.

I strongly recommend reading the previous post if you have not, as this builds off that one. If you want to read that or any of the other things that I have covered, then here are the links to my series:

Unbounded Lock-Free Queue

We are going to stick with using a queue so that you can easily see the changes that have to be made in order to safely remove the locks. This leads to a lot more complexity and fragility, since everything has to happen in a specific order to guarantee that things keep working the way you expect. We will stick with an unbounded queue still, which will keep memory allocation as part of the implementation. Depending on the allocator that you are using, this may include locks, so you could easily argue that this isn’t a true lock-free implementation. However, there are memory allocators that make heavy use of thread local storage (such as Intel’s TBB allocator) that avoid these locks in most cases. I am also not protecting against possible ABA issues, mainly because I haven’t encountered them in my testing yet and so want to hold off on introducing code that would prevent that situation. In the next post, I hope to tackle these issues.

The basic design and logic of the queue remains the same as before. Again, this is to try and provide the reader with implementations that are as identical as possible except for their locking strategy. The interface to the queue is also likewise consistent. I am also sticking with using an int datatype for the examples, but will provide links to the full templatized code at the bottom. So stating that, let us look at what needs to change for a lock-free strategy.

 

Enqueue


 

As with our previous implementation, the first thing we do is allocate a node to store our data into. Then you will notice that we enter an infinite loop, which is a common strategy with lock-free code. While locking code will often loop waiting for a certain condition to be met, lock-free typically wants to at least make some progress with each iteration of the loop. If it is interrupting another thread, then a lock-free algorithm may help the interrupted thread get its work partially finished, so that when the current thread loops again it has everything in place to properly do its job. With our enqueue this may happen if we interrupt a thread which has added a node to the end, but not yet had the chance to update the tail. If the current thread notices the queue in this state, it will properly update the tail and then attempt to add its node.

Anyway, when we are allocating our node you will notice that we release the data from the unique_ptr instead of moving it. This is because the implementation of unique_ptr is not thread-safe, so we need to store the data directly by pointer. We could store it using a shared_ptr, which is thread safe, if so desired, but I am not doing that because the node is completely contained within the queue. Because of this, I know that nothing else will be touching the data or its location and so can use normal pointers throughout and save on some performance.

At the beginning of our loop, we get local copies of our tail and its next value (which should be nullptr). Because there aren’t any locks, the state can change at any moment, so we need to verify everything each step of the way. This is why we check to make sure our tail is still our local value and that it is pointing to nullptr as its next value. We then verify this yet again via a compare_exchange (which is a Compare-And-Set atomic operation originally discussed a few posts ago) when we go to set the next value to the node that we are adding to the queue. If this succeeds, then all that is left to do is to set the tail to our new node. However, if that compare_exchange doesn’t work, then all it means is that another thread already did that which is fine by us.

A quick note: Since this is intended for x86-64 which has strong  memory ordering guarantees, then there actually isn’t any difference between the weak and strong versions of compare_exchange. However, if this were to be used on a platform with a weaker memory model, then I believe the way it is coded would be the best way to implement them.  

Dequeue

 

For dequeuing our data using a lock-free approach, again we begin with an infinite loop. This time we store off the head value, the tail, and node that we wish to get the data from. With our locking implementation, we didn’t have to worry about the tail at all. However we have to account for the fact that something could be in the middle of adding to the list while we are attempting to remove from it, and so we might be in the best position to finish their work. We check for this possibility after discounting the fact that the queue is empty.

However, if there are nodes to get, then we find out where our data will be located, then attempt to remove the first node. Only if that succeeds can we delete the former head node and return the proper data. If we were to attempt to remove the head before finding out where the data was, then we could run into a situation where another thread then removes our data containing node and deletes it before we have a chance to get at it.

 

Node

 

Our node looks similar, but as mentioned earlier we are using a normal pointer to refer to our data. The other big difference is the use of std::atomic to refer to our next node. This is because we need to do atomic operations on our data in order to make our algorithm lock-free. It also enforces that we use the appropriate memory barriers when accessing our data. Because we are running on an x84-64 implementation, we only need to worry about the CPU reordering writes. This isn’t too important to protect against in this particular case because we do all writes as atomic operations, which won’t be reordered. The compiler might reorder reads or writes, however, so this is still somewhat useful for us so we make sure we aren’t surprised by any optimizations that it might do. So using the atomic variable type helps enforce the intended behavior for us no matter what we are programming for.

 

Recap and Next Time

As you can see, getting the ordering of lock-free algorithms correct is something that needs to be closely paid attention to. Moving to lock-free effectively squares your problem space, since you have to consider what every other thread might be doing at any point in the code. Handling this can be quite difficult, and it also means that you can implement some very subtle bugs even despite the best of care. So lock-free stuff should only be used in areas where there is a lot of overlapping thread traffic and you need the performance gain. But I hope these examples helped illuminate some of the processes that go on behind what is involved when writing lock-free code.

Next time we are going to look at how our locking and lock-free code operate in terms of performance. We will also see if we can get some more performance out of them by restricting the number of nodes since in games we generally have a good idea of how many nodes we might be working with, and how that might introduce other complexities.

 

Full Source

The full source for this can be obtained at the links below. This is a more generic, template based version of the code posted above so hopefully it will be a bit more useful for you.