Restart issue with BLCR 0.8.0b5

From: Matthias Hovestadt (matthias.hovestadt_at_tu-berlin.de)
Date: Mon Jan 05 2009 - 08:48:14 PST

  • Next message: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"
    Hi!
    
    I'm using BLCR for realizing fault tolerance in cluster
    systems, having the resource management system generating
    checkpoints from running jobs. In case of MPI parallel jobs,
    I'm using LAM+BLCR and OpenMPI+BLCR for realizing this fault
    tolerance.
    
    With LAM+BLCR I now have an issue in restarting LAM-MPI jobs.
    The checkpoint has been generated by the resource management
    system using the cr_checkpoint command:
    
       cr_checkpoint 5631
    
    where 5631 is the PID of the mpirun process. This cr_checkpoint
    command succeeds with errorcode 0, having the following files
    generated in the home directory of the user:
    
    
    -----------------------------------------------------------
    testuser@asok14-5:~$ ls -als context*
    total 3308
        4 drwxr-xr-x  2 testuser testuser    4096 Jan  5 18:18 .
        4 drwxr-xr-x 12 testuser testuser    4096 Jan  5 18:35 ..
      444 -r--------  1 testuser testuser  447767 Jan  5 18:15 context.5631
    1232 -r--------  1 testuser testuser 1255150 Jan  5 18:15 
    context.5631-n0-5632
    1620 -r--------  1 testuser testuser 1651993 Jan  5 18:15 
    context.5631-n1-17822
    testuser@asok14-5:~$
    -----------------------------------------------------------
    
    
    If I now try to restart from this checkpoint (regardless whether
    the checkpoint command is issued manually by the user or
    automatically over the the resource management system) using the
    command
    
       cr_restart context.5631
    
    the restart command fails, having these error messages in the
    syslog:
    
    Jan  5 18:18:35 asok14-5 kernel: [63441.370972] blcr: rstrt_watchdog: 
    tgid/pid 5631/5631 exec()ed 'mpirun' during restart
    Jan  5 18:18:35 asok14-5 kernel: [63441.370978] blcr: rstrt_watchdog: 
    'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart
    
    
    Surprisingly, this issue only affects jobs that have been started by
    the resource management system. If I start the same job by hand (using
    the same commands on the same cluster nodes), I can checkpoint and
    restart without any problem.
    
    For getting some more information I started the cr_restart command
    using the strace tool. I then compared the output of a working
    restart (checkpoint from a manually started job) and the output
    of a failing restart (checkpoint from a job that has been started
    the the resource management system).
    
    Both strace outputs are identical until the "rt_sigaction" lines.
    Then the output differs.
    
    
    The working restart has the following output:
    
    -----------------------------------------------------------
    .
    .
    .
    rt_sigaction(SIGRT_29, {0x402660, [], 
    SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0
    rt_sigaction(SIGRT_30, {0x402660, [], 
    SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0
    rt_sigaction(SIGRT_31, {0x402660, [], 
    SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2b578fb61200}, NULL, 8) = 0
    close(4)                                = 0
    select(6, [5], NULL, NULL, 
    NULL..........................................................)        = 
    1 (in [5])
    ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153
    ioctl(5, 0xffffffff8008a127, 0x7fff1b57f860) = 153
    ioctl(5, 0xa122, 0xffffffffffffffff)    = 15084
    close(5)                                = 0
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    .
    .
    .
    -----------------------------------------------------------
    
    
    In case of the failing restart:
    
    -----------------------------------------------------------
    .
    .
    .
    rt_sigaction(SIGRT_29, {0x402660, [], 
    SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0
    rt_sigaction(SIGRT_30, {0x402660, [], 
    SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0
    rt_sigaction(SIGRT_31, {0x402660, [], 
    SA_RESTORER|SA_RESTART|SA_NOMASK|SA_SIGINFO, 0x2afb0b016200}, NULL, 8) = 0
    close(4)                                = 0
    select(6, [5], NULL, NULL, NULL)        = ? ERESTARTNOHAND (To be restarted)
    --- SIGCHLD (Child exited) @ 0 (0) ---
    select(6, [5], NULL, NULL, NULL)        = 1 (in [5])
    ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149
    ioctl(5, 0xffffffff8008a127, 0x7fffa00c8390) = 149
    ioctl(5, 0xa122, 0xffffffffffffffff)    = 5631
    close(5)                                = 0
    write(2, "- ", 2- )                       = 2
    write(2, "rstrt_watchdog: tgid/pid 5631/56"..., 67rstrt_watchdog: 
    tgid/pid 5631/5631 exec()ed 'mpirun' during restart) = 67
    write(2, "\n", 1
    )                       = 1
    write(2, "- ", 2- )                       = 2
    write(2, "rstrt_watchdog: \'mpirun\' (tgid/p"..., 79rstrt_watchdog: 
    'mpirun' (tgid/pid 5631/5633) exited with code 0 during restart) = 79
    write(2, "\n", 1
    )                       = 1
    rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    wait4(5631, [{WIFEXITED(s) && WEXITSTATUS(s) == 215}], __WCLONE|__WALL, 
    NULL) = 5631
    exit_group(215)                         = ?
    Process 7482 detached
    testuser@asok14-5:~$
    -----------------------------------------------------------
    
    
    Does anybody have an idea what might be the problem? Or is there
    any way of increasing the debug level, getting more verbose logfile
    output?
    
    
    Best,
    Matthias
    

  • Next message: Paul H. Hargrove: "Re: Restart issue with BLCR 0.8.0b5"