State of multithreaded process restore

From: Eric Roman (eroman_at_lbl.gov)
Date: Tue Sep 24 2002 - 18:22:06 PDT


Just wanted to be a little more detailed on a point I tried to make earlier
on the phone.

Here's the problem:  We cannot (at this point) guarantee that restored threads
have the proper parent-child relationships. 

It is possible that sometimes we get lucky and restore things correctly.  Here's an
example.

The nastiness happens with pthreads calls _after_ the restarted process
has come back: Let's look at the piece of code I've been playing with.
It creates two threads, let's 'em run, they die, then we create two more
threads, those run and the whole application exits.

n22001@pts/9:~/% cat crthreadtest.c
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>

void do_one_thing(int *);
void do_another_thing(int *);
void do_wrap_up(int, int);

void
show_blocked(void)
{
    sigset_t blocked;
    int i;

    sigprocmask(0, NULL, &blocked);
    printf("sigblk: ");
    for (i=0; i < _NSIG; i++) {
        if (blocked.__val[i / (sizeof(long)*8)] & (1 << (i % (sizeof(long)*8))))
            printf(" %d", i+1);
    }
    printf("sigblk: ");
    for (i=0; i < sizeof(blocked.__val[0])/sizeof(*(blocked.__val)); ++i) {
      printf("[%u]", blocked.__val[i]);
    }
    printf("\n");
}

void do_one_thing(int *out) {
  int i;

  for (i=0; i<=9; ++i) {
    printf("Hello <%d> from %d:  ", i, getpid());
    show_blocked();
    fflush(stdout);
    *out = getpid();
    sleep(1);
  }
}

void do_another_thing(int *out) {
  int i;

  for (i=0; i<=9; ++i) {
    printf("Hello <%d> from %d:  ", i, getpid());
    show_blocked();
    fflush(stdout);
    *out = getpid();
    sleep(1);
  }
}

void
do_wrap_up(int a, int b)
{
    printf("my pid is %d\n", getpid());
    printf("a = %d\n", a);
    printf("b = %d\n", b);
}

int r1 = 0;
int r2 = 0;

int main(void)
{
    pthread_t thread1;
    pthread_t thread2;
 
    printf("Main thread: %d\n", getpid());
    
    pthread_create(&thread1, NULL, (void *) do_one_thing, (void *) &r1);
    pthread_create(&thread2, NULL, (void *) do_another_thing, (void *) &r2);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    do_wrap_up(r1,r2);

    pthread_create(&thread1, NULL, (void *) do_one_thing, (void *) &r1);
    pthread_create(&thread2, NULL, (void *) do_another_thing, (void *) &r2);

    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);

    return 7;
}

--- Case I ---
During normal execution, we expect this code to run like this:
n22001@pts/9:~/% ./crthreadtest 
Main thread: 2609
Hello <0> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <0> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <1> from 2611:  Hello <1> from 2612:  sigblk: sigblk:  32 32sigblk: sigblk
: [2147483648][2147483648]

Hello <2> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <2> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <3> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <3> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <4> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <4> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <5> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <5> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <6> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <6> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <7> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <7> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <8> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <8> from 2612:  sigblk:  32sigblk: [2147483648]
Hello <9> from 2611:  sigblk:  32sigblk: [2147483648]
Hello <9> from 2612:  sigblk:  32sigblk: [2147483648]
my pid is 2609
a = 2611
b = 2612
Hello <0> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <0> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <1> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <1> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <2> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <2> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <3> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <3> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <4> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <4> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <5> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <5> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <6> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <6> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <7> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <7> from 2614:  sigblk:  32sigblk: [2147483648]
Hello <8> from 2613:  Hello <8> from 2614:  sigblk: sigblk:  32 32sigblk: sigblk: [2147483648][2147483648]

Hello <9> from 2613:  sigblk:  32sigblk: [2147483648]
Hello <9> from 2614:  sigblk:  32sigblk: [2147483648]

--- Case II ---
Here's _incorrect_ output from after a restart:
ft33@pts/3:~/% ~/src/checkpoint-docs/src/util/cr_restart/cr_restart Context
4345: Returning from RSTRT_REQ. (err=0).  
4345:<RSTRT_REQ> Calling RSTRT_CHILD.
4344: Returning from RSTRT_REQ. (err=4345).  
4344: ioctl(/proc/checkpoint/ctrl, CR_IOCTL_RSTRT_DONE): Invalid argument
ft33@pts/3:~/% 4346:Returning from RSTRT_CHILD #1. (err=0).  
4346:<RSTRT_CHILD> Calling RSTRT_CHILD
4348:Returning from RSTRT_CHILD #1. (err=0).  
4348:<RSTRT_CHILD> Calling RSTRT_CHILD
4347:Returning from RSTRT_CHILD #1. (err=0).  
4347:<RSTRT_CHILD> Calling RSTRT_CHILD
Hello <2> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <2> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <3> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <3> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <4> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <4> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <5> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <5> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <6> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <6> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <7> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <7> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <8> from 4345:  sigblk:  32sigblk: [2147483648]
Hello <8> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <9> from 4347:  sigblk:  32sigblk: [2147483648]
Hello <9> from 4345:  sigblk:  32sigblk: [2147483648]
my pid is 4346
a = 4347
b = 4345

After that, nothing happens.  This thing is in limbo.  Notice that some funny things
have happened.

--- Case III ---
Here's _correct_ output from after a restart:
ft33@pts/3:~/% ~/src/checkpoint-docs/src/util/cr_restart/cr_restart Context
4352: Returning from RSTRT_REQ. (err=0).  
4351: Returning from RSTRT_REQ. (err=4352).  
4351: ioctl(/proc/checkpoint/ctrl, CR_IOCTL_RSTRT_DONE): Invalid argument
ft33@pts/3:~/% 4352:<RSTRT_REQ> Calling RSTRT_CHILD.
4353:Returning from RSTRT_CHILD #1. (err=0).  
4353:<RSTRT_CHILD> Calling RSTRT_CHILD
4355:Returning from RSTRT_CHILD #1. (err=0).  
4355:<RSTRT_CHILD> Calling RSTRT_CHILD
4354:Returning from RSTRT_CHILD #1. (err=0).  
4354:<RSTRT_CHILD> Calling RSTRT_CHILD
Hello <2> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <2> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <3> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <3> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <4> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <4> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <5> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <5> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <6> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <6> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <7> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <7> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <8> from 4354:  sigblk:  32sigblk: [2147483648]
Hello <8> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <9> from 4355:  sigblk:  32sigblk: [2147483648]
Hello <9> from 4354:  sigblk:  32sigblk: [2147483648]
my pid is 4352
a = 4354
b = 4355
Hello <0> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <0> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <1> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <1> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <2> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <2> from 4357:  sigblk:  32sigblk: [2147483648]
ft33@pts/3:~/% Hello <3> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <3> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <4> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <4> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <5> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <5> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <6> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <6> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <7> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <7> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <8> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <8> from 4357:  sigblk:  32sigblk: [2147483648]
Hello <9> from 4356:  sigblk:  32sigblk: [2147483648]
Hello <9> from 4357:  sigblk:  32sigblk: [2147483648]

In case (I) above (no checkpoint), the main thread 2609, is the parent of
the pthreads manager 2610, and that in turn is the parent of threads 2611,
and 2612.  After threads 2611 and 2612 exit, the pthreads manager is the
parent of threads 2613 and 2614.  So our process tree sort of looks like a
stick figure with no arms and big feet:

   2609 -- main app thread
    |
   2610 -- pthreads manager
   / \
2613 2614 -- other application threads

In case (II), things got rather weird.  
     4344 -- cr_restart
      |
     4345 -- application thread
    /  | \
4346   |  \  -- main application thread
     4347  \ -- another application thread
          4348 -- the pthreads manager thread

Now in case (III), things look like this:
     4351 -- cr_restart
      |
     4352 -- main application thread
    /  | \
4353   |  \  -- pthreads manager
     4354  \ -- application thread
          4355 -- other application thread

Case III is the good case.  I'm not sure why this one works, and the other
case (II) does not.  I've noticed that in case II the pthread manager hangs.
It's not in the pthread_join, since we've moved past there, but odds are
in the pthread_create.

My gut feeling about case III is that this is due to the change in task
linkage.  (Every time one of these works, it has a linkage that looks that
way, so there is _some_ relationship) I'm feeling rather confident that fixing
the linkage will fix the whole issue.  (Suggestions for how the fault plays
out in case II are welcome.)

So the end story is this:  

1/  In no case should cr_restart spit out junk processes, or processes that
immediately go away.  

   (It actually does this quite frequently.  My biggest problem has been
   restarting from files with recompiled executables.  Usually, these
   seg fault.)

2/  In some cases, pthreads applications will restart, but new threads
cannot be created.  

    (Possible that other pthreads features do not work either, I have not
    tested these yet.  I need to go through and test a semaphore sometime)

3/  You should definitely get output from any of these sequences.

 - E

-- 
Eric Roman                       Future Technologies Group
510-486-6420                     Lawrence Berkeley National Laboratory