From: Eric Roman (eroman_at_lbl.gov)
Date: Tue Sep 24 2002 - 18:22:06 PDT
Just wanted to be a little more detailed on a point I tried to make earlier on the phone. Here's the problem: We cannot (at this point) guarantee that restored threads have the proper parent-child relationships. It is possible that sometimes we get lucky and restore things correctly. Here's an example. The nastiness happens with pthreads calls _after_ the restarted process has come back: Let's look at the piece of code I've been playing with. It creates two threads, let's 'em run, they die, then we create two more threads, those run and the whole application exits. n22001@pts/9:~/% cat crthreadtest.c #include <pthread.h> #include <stdio.h> #include <unistd.h> #include <signal.h> void do_one_thing(int *); void do_another_thing(int *); void do_wrap_up(int, int); void show_blocked(void) { sigset_t blocked; int i; sigprocmask(0, NULL, &blocked); printf("sigblk: "); for (i=0; i < _NSIG; i++) { if (blocked.__val[i / (sizeof(long)*8)] & (1 << (i % (sizeof(long)*8)))) printf(" %d", i+1); } printf("sigblk: "); for (i=0; i < sizeof(blocked.__val[0])/sizeof(*(blocked.__val)); ++i) { printf("[%u]", blocked.__val[i]); } printf("\n"); } void do_one_thing(int *out) { int i; for (i=0; i<=9; ++i) { printf("Hello <%d> from %d: ", i, getpid()); show_blocked(); fflush(stdout); *out = getpid(); sleep(1); } } void do_another_thing(int *out) { int i; for (i=0; i<=9; ++i) { printf("Hello <%d> from %d: ", i, getpid()); show_blocked(); fflush(stdout); *out = getpid(); sleep(1); } } void do_wrap_up(int a, int b) { printf("my pid is %d\n", getpid()); printf("a = %d\n", a); printf("b = %d\n", b); } int r1 = 0; int r2 = 0; int main(void) { pthread_t thread1; pthread_t thread2; printf("Main thread: %d\n", getpid()); pthread_create(&thread1, NULL, (void *) do_one_thing, (void *) &r1); pthread_create(&thread2, NULL, (void *) do_another_thing, (void *) &r2); pthread_join(thread1, NULL); pthread_join(thread2, NULL); do_wrap_up(r1,r2); pthread_create(&thread1, NULL, (void *) do_one_thing, (void *) &r1); pthread_create(&thread2, NULL, (void *) do_another_thing, (void *) &r2); pthread_join(thread1, NULL); pthread_join(thread2, NULL); return 7; } --- Case I --- During normal execution, we expect this code to run like this: n22001@pts/9:~/% ./crthreadtest Main thread: 2609 Hello <0> from 2611: sigblk: 32sigblk: [2147483648] Hello <0> from 2612: sigblk: 32sigblk: [2147483648] Hello <1> from 2611: Hello <1> from 2612: sigblk: sigblk: 32 32sigblk: sigblk : [2147483648][2147483648] Hello <2> from 2611: sigblk: 32sigblk: [2147483648] Hello <2> from 2612: sigblk: 32sigblk: [2147483648] Hello <3> from 2611: sigblk: 32sigblk: [2147483648] Hello <3> from 2612: sigblk: 32sigblk: [2147483648] Hello <4> from 2611: sigblk: 32sigblk: [2147483648] Hello <4> from 2612: sigblk: 32sigblk: [2147483648] Hello <5> from 2611: sigblk: 32sigblk: [2147483648] Hello <5> from 2612: sigblk: 32sigblk: [2147483648] Hello <6> from 2611: sigblk: 32sigblk: [2147483648] Hello <6> from 2612: sigblk: 32sigblk: [2147483648] Hello <7> from 2611: sigblk: 32sigblk: [2147483648] Hello <7> from 2612: sigblk: 32sigblk: [2147483648] Hello <8> from 2611: sigblk: 32sigblk: [2147483648] Hello <8> from 2612: sigblk: 32sigblk: [2147483648] Hello <9> from 2611: sigblk: 32sigblk: [2147483648] Hello <9> from 2612: sigblk: 32sigblk: [2147483648] my pid is 2609 a = 2611 b = 2612 Hello <0> from 2613: sigblk: 32sigblk: [2147483648] Hello <0> from 2614: sigblk: 32sigblk: [2147483648] Hello <1> from 2613: sigblk: 32sigblk: [2147483648] Hello <1> from 2614: sigblk: 32sigblk: [2147483648] Hello <2> from 2613: sigblk: 32sigblk: [2147483648] Hello <2> from 2614: sigblk: 32sigblk: [2147483648] Hello <3> from 2613: sigblk: 32sigblk: [2147483648] Hello <3> from 2614: sigblk: 32sigblk: [2147483648] Hello <4> from 2613: sigblk: 32sigblk: [2147483648] Hello <4> from 2614: sigblk: 32sigblk: [2147483648] Hello <5> from 2613: sigblk: 32sigblk: [2147483648] Hello <5> from 2614: sigblk: 32sigblk: [2147483648] Hello <6> from 2613: sigblk: 32sigblk: [2147483648] Hello <6> from 2614: sigblk: 32sigblk: [2147483648] Hello <7> from 2613: sigblk: 32sigblk: [2147483648] Hello <7> from 2614: sigblk: 32sigblk: [2147483648] Hello <8> from 2613: Hello <8> from 2614: sigblk: sigblk: 32 32sigblk: sigblk: [2147483648][2147483648] Hello <9> from 2613: sigblk: 32sigblk: [2147483648] Hello <9> from 2614: sigblk: 32sigblk: [2147483648] --- Case II --- Here's _incorrect_ output from after a restart: ft33@pts/3:~/% ~/src/checkpoint-docs/src/util/cr_restart/cr_restart Context 4345: Returning from RSTRT_REQ. (err=0). 4345:<RSTRT_REQ> Calling RSTRT_CHILD. 4344: Returning from RSTRT_REQ. (err=4345). 4344: ioctl(/proc/checkpoint/ctrl, CR_IOCTL_RSTRT_DONE): Invalid argument ft33@pts/3:~/% 4346:Returning from RSTRT_CHILD #1. (err=0). 4346:<RSTRT_CHILD> Calling RSTRT_CHILD 4348:Returning from RSTRT_CHILD #1. (err=0). 4348:<RSTRT_CHILD> Calling RSTRT_CHILD 4347:Returning from RSTRT_CHILD #1. (err=0). 4347:<RSTRT_CHILD> Calling RSTRT_CHILD Hello <2> from 4345: sigblk: 32sigblk: [2147483648] Hello <2> from 4347: sigblk: 32sigblk: [2147483648] Hello <3> from 4347: sigblk: 32sigblk: [2147483648] Hello <3> from 4345: sigblk: 32sigblk: [2147483648] Hello <4> from 4345: sigblk: 32sigblk: [2147483648] Hello <4> from 4347: sigblk: 32sigblk: [2147483648] Hello <5> from 4347: sigblk: 32sigblk: [2147483648] Hello <5> from 4345: sigblk: 32sigblk: [2147483648] Hello <6> from 4345: sigblk: 32sigblk: [2147483648] Hello <6> from 4347: sigblk: 32sigblk: [2147483648] Hello <7> from 4347: sigblk: 32sigblk: [2147483648] Hello <7> from 4345: sigblk: 32sigblk: [2147483648] Hello <8> from 4345: sigblk: 32sigblk: [2147483648] Hello <8> from 4347: sigblk: 32sigblk: [2147483648] Hello <9> from 4347: sigblk: 32sigblk: [2147483648] Hello <9> from 4345: sigblk: 32sigblk: [2147483648] my pid is 4346 a = 4347 b = 4345 After that, nothing happens. This thing is in limbo. Notice that some funny things have happened. --- Case III --- Here's _correct_ output from after a restart: ft33@pts/3:~/% ~/src/checkpoint-docs/src/util/cr_restart/cr_restart Context 4352: Returning from RSTRT_REQ. (err=0). 4351: Returning from RSTRT_REQ. (err=4352). 4351: ioctl(/proc/checkpoint/ctrl, CR_IOCTL_RSTRT_DONE): Invalid argument ft33@pts/3:~/% 4352:<RSTRT_REQ> Calling RSTRT_CHILD. 4353:Returning from RSTRT_CHILD #1. (err=0). 4353:<RSTRT_CHILD> Calling RSTRT_CHILD 4355:Returning from RSTRT_CHILD #1. (err=0). 4355:<RSTRT_CHILD> Calling RSTRT_CHILD 4354:Returning from RSTRT_CHILD #1. (err=0). 4354:<RSTRT_CHILD> Calling RSTRT_CHILD Hello <2> from 4354: sigblk: 32sigblk: [2147483648] Hello <2> from 4355: sigblk: 32sigblk: [2147483648] Hello <3> from 4355: sigblk: 32sigblk: [2147483648] Hello <3> from 4354: sigblk: 32sigblk: [2147483648] Hello <4> from 4354: sigblk: 32sigblk: [2147483648] Hello <4> from 4355: sigblk: 32sigblk: [2147483648] Hello <5> from 4355: sigblk: 32sigblk: [2147483648] Hello <5> from 4354: sigblk: 32sigblk: [2147483648] Hello <6> from 4354: sigblk: 32sigblk: [2147483648] Hello <6> from 4355: sigblk: 32sigblk: [2147483648] Hello <7> from 4355: sigblk: 32sigblk: [2147483648] Hello <7> from 4354: sigblk: 32sigblk: [2147483648] Hello <8> from 4354: sigblk: 32sigblk: [2147483648] Hello <8> from 4355: sigblk: 32sigblk: [2147483648] Hello <9> from 4355: sigblk: 32sigblk: [2147483648] Hello <9> from 4354: sigblk: 32sigblk: [2147483648] my pid is 4352 a = 4354 b = 4355 Hello <0> from 4356: sigblk: 32sigblk: [2147483648] Hello <0> from 4357: sigblk: 32sigblk: [2147483648] Hello <1> from 4356: sigblk: 32sigblk: [2147483648] Hello <1> from 4357: sigblk: 32sigblk: [2147483648] Hello <2> from 4356: sigblk: 32sigblk: [2147483648] Hello <2> from 4357: sigblk: 32sigblk: [2147483648] ft33@pts/3:~/% Hello <3> from 4356: sigblk: 32sigblk: [2147483648] Hello <3> from 4357: sigblk: 32sigblk: [2147483648] Hello <4> from 4356: sigblk: 32sigblk: [2147483648] Hello <4> from 4357: sigblk: 32sigblk: [2147483648] Hello <5> from 4356: sigblk: 32sigblk: [2147483648] Hello <5> from 4357: sigblk: 32sigblk: [2147483648] Hello <6> from 4356: sigblk: 32sigblk: [2147483648] Hello <6> from 4357: sigblk: 32sigblk: [2147483648] Hello <7> from 4356: sigblk: 32sigblk: [2147483648] Hello <7> from 4357: sigblk: 32sigblk: [2147483648] Hello <8> from 4356: sigblk: 32sigblk: [2147483648] Hello <8> from 4357: sigblk: 32sigblk: [2147483648] Hello <9> from 4356: sigblk: 32sigblk: [2147483648] Hello <9> from 4357: sigblk: 32sigblk: [2147483648] In case (I) above (no checkpoint), the main thread 2609, is the parent of the pthreads manager 2610, and that in turn is the parent of threads 2611, and 2612. After threads 2611 and 2612 exit, the pthreads manager is the parent of threads 2613 and 2614. So our process tree sort of looks like a stick figure with no arms and big feet: 2609 -- main app thread | 2610 -- pthreads manager / \ 2613 2614 -- other application threads In case (II), things got rather weird. 4344 -- cr_restart | 4345 -- application thread / | \ 4346 | \ -- main application thread 4347 \ -- another application thread 4348 -- the pthreads manager thread Now in case (III), things look like this: 4351 -- cr_restart | 4352 -- main application thread / | \ 4353 | \ -- pthreads manager 4354 \ -- application thread 4355 -- other application thread Case III is the good case. I'm not sure why this one works, and the other case (II) does not. I've noticed that in case II the pthread manager hangs. It's not in the pthread_join, since we've moved past there, but odds are in the pthread_create. My gut feeling about case III is that this is due to the change in task linkage. (Every time one of these works, it has a linkage that looks that way, so there is _some_ relationship) I'm feeling rather confident that fixing the linkage will fix the whole issue. (Suggestions for how the fault plays out in case II are welcome.) So the end story is this: 1/ In no case should cr_restart spit out junk processes, or processes that immediately go away. (It actually does this quite frequently. My biggest problem has been restarting from files with recompiled executables. Usually, these seg fault.) 2/ In some cases, pthreads applications will restart, but new threads cannot be created. (Possible that other pthreads features do not work either, I have not tested these yet. I need to go through and test a semaphore sometime) 3/ You should definitely get output from any of these sequences. - E -- Eric Roman Future Technologies Group 510-486-6420 Lawrence Berkeley National Laboratory