From: Alan Woodland (alan.woodland_at_gmail_dot_com)
Date: Tue Aug 25 2009 - 04:16:09 PDT
2009/8/25 Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>: > Alan Woodland wrote: >> 2009/8/24 Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>: >> >> Which aspect of 'signalsafe' is problematic here? The async safe part? >> (I.e. if we get interrupted by a signal halfway through an atomic op >> we'd be holding a global lock and deadlock if we called another atomic >> op from the signal handler?) >> > > This is exactly what we need to work, because the checkpoint request arrives > via a signal handler and the "interaction" between critical sections and the > checkpoint requests is via a "red-black lock" implemented via signal-safe > atomics. Note that the issue is not a "global" lock, but that the common > case is that the signal handler and the code it interrupts are accessing the > same atomic variable. That is why a "checkout" based approach using > test-and-set or load-and-clear to lock even on the granularity of a single > word is not acceptable. > >> Which parts of the library actually need to be async safe? Is it just >> things which get called from the 'my_handler' functions in cr_cs.c and >> cr_async.c? (Also what's the problem in blocking signals whilst inside >> a replacement CAS function? Is it the unblockable signals?) >> > > I think you've listed the right parts. The reason for not blocking signals > is two-fold > 1) You can't block the checkpoint signal because BLCR will just unblocking > from the kernel side > 2) You wouldn't want to block it if you could, because you may be spinning > on a change of value that will only occur in a signal handler. I've thought about it some more and I think there might be a workaround for modern (~ 2.6.20 IIRC) kernels and glibc(2.9?) using signalfd(2). You can get signals delivered via a file descriptor which sidesteps some of the problems making things async-safe I think because the rest of your threads remain runnable. So what this would need is two things then, firstly a thread dedicated to handling signals via signalfd, and secondly a way of ensuring that whilst you hold a lock inside an atomic_compare_and_swap replacement function signals never get delivered to that thread. Does that sound remotely sensible? It would avoid the deadlock in the signal handler routines because progress could always be made, and it would avoid the problem sof blocking signals because they wouldn't be totally blocked as far as I can see? I've not seen anything that would cause problems by running the signal handlers outside of a 'traditional' signal context, or in a dedicated thread? The only problem I can see would be with signals directed at a specific thread rather than the process as a whole, with the signal thread wouldn't get to see ever. That could be worked around with a handler that forwarded the signal to the signal handling thread I think. (In which case blocking the signals whilst inside the CAS replacement wouldn't be the right description anymore) Alan