Sunday, November 15, 2015

Thread cancelling caveats in pthreads

This is one of the posts I moved from my previous blog. I am preparing a long article about network drivers in Linux but it seems that it will take some more time. So I am posting this article I wrote a few years ago to keep the blog alive.

Today I will talk about two things I encountered while working with conditional mutexes in pthreads which are somewhat weird.

  1. If a thread gets cancelled while being blocked by pthread_cond_wait(), cancelled thread decides to lock the condition mutex before exiting. (as a result no one else can unlock it)
  2. pthread_cleanup_push() macro doesn't compile if another pthread_cleanup_pop() macro is not included in the same block of code.

If you read the man pages or some mailing list discussions they will convince you that this is the way things should work. But I think they are caveats one should be beware of.

I ended up using these functions/macros while working on a somewhat different multi-threaded server design. Some threads wait (blocked) on a conditional mutex and when some other thread signals the condition they are allowed to read some data protected by a regular mutex one after another. However a thread waiting on the condition may have to be terminated in its sleep.

Some brief explanation on how pthread_cond_wait() and pthread_cond_signal() / pthread_cond_broadcast() functions work:

You need two things when using conditions: a pthread_cond_t struct and a regular pthread_mutex_t mutex. You lock the mutex whenever you use one of pthread_cond_* functions with the respective pthread_cond_t struct.

When you lock the mutex and call pthread_cond_wait() function it unlocks the mutex automatically and blocking waits indefinitely until the condition is signaled or broadcasted by another thread. Other threads can use the mutex as they like during this time. Most likely some of them will also use it to wait on the same condition.

When another thread signals or broadcasts that the condition is reached one of the threads waiting on the condition becomes scheduled (ie. running) again. When that happens the blocked pthread_cond_wait() function locks the mutex again before returning. After that, you unlock the mutex again in thread's code so that other threads waiting on the condition gets scheduled too.

However if a thread gets canceled while waiting on a condition with pthread_cond_wait() (which is a cancellation point) it reacquires the mutex and exits while the mutex is locked. You may think that cancelling a blocked thread would not change any global variables, but it does. And for a good reason. However if you don't take precautions for this behaviour, the cancelled thread leaves the mutex in locked state just before saying goodbye and that causes a deadlock for the rest of the program.

Obviously you need to set free the condition mutex after a thread is canceled. But how can you know when exactly the thread is canceled? You can't just add pthread_mutex_unlock() right after pthread_cancel(). Threads are canceled asynchronously and you may free the mutex some other thread has locked. Then the canceled thread acquires/locks it again and you still have a deadlock.

One way is making sure that the thread is destroyed by using pthread_join() right after pthread_cancel(). If the thread is gone for good that means the mutex was last locked by it and it is safe to unlock it. However a better way is to use pthread_cleanup_push() and a cleanup handler.

I thought that pthread_cleanup_push() was a function which added a function pointer (like a hook) to some "cleanup functions stack". I put it somewhere in the beginning of the thread routine. Then when I tried to compile, compiler gave me an error stating that I needed a while statement somewhere far below in the code. It is really interesting to see a compiler suggesting that you should implement a while loop in some random place of your code.

When I read the macro definition of pthread_cleanup_push() in /usr/include/pthread.h I realized it leaves an open curly brace, namely "{". First thing I thought was "Man, someone forgot to close a parenthesis in the last version of glibc!". I immediately typed "pthread_cleanup_push macro broken" to Google and as it turned out the same mistake was done in 2005 and 2009 too! However after reading the link to the mailing list discussion in 2005 everything became clear.

So here's how to use it: You place the pthread_cleanup_push() macro in the beginning of your thread routine. It takes the whole routine inside a "do { }" statement. Then you place pthread_cleanup_pop() at the end of the thread routine which closes the said curly brace and does some other stuff. If you write pthread_cleanup_pop(1) (or some other non-zero number) it executes the cleanup routine before exiting the thread. If you give 0 it just goes on and thread routine ends. (You can also use them in other parts of the thread routine, but usually placing them in the beginning and end makes sense.)

So, you can use pthread_cleanup functions if you need to cancel some of the multiple threads waiting on a condition while they are waiting. You just need to be careful that pthread_cleanup_push() and pthread_cleanup_pop() are placed in the same block. They open and close curly braces which means if you use them in different blocks they'll break your code. (I guess you can still nest them though.)

So I hope it will be helpful for anyone having the similar problems. If you came here looking for a solution to these problems please drop a comment, I want to know if my blog posts actually help anyone :)

By the way, did you know that there is a character named "Curly Brace" in freeware retro Japanese game 洞窟物語 ("Cave Story" in English)? The main character is named "Quote" (short for quotation mark I guess). You can tell the game was developed by a programmer geek.

No comments:

Post a Comment