BLCR problem on RHAS

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 30 2005 - 11:58:39 PST

  • Next message: Richard Hu: "multiple checkpoints"
    Guys --
    
    We're trying to get BLCR going on one of the big linux clusters here at 
    IU and are running into problems -- even with non-MPI apps.  Here's the 
    details:
    
    - Linux bc02 2.4.21-27.0.2.EL.050316 #2 SMP Wed Mar 16 14:17:19 EST 
    2005 i686 i686 i386 GNU/Linux
    - Dual Xeon 2.4GHz hardware
    - BLCR v0.4.0
    - blcr and vmadump_blcr kernel modules are successfully loaded
    
    If I have a simple non-MPI app:
    
    -----
    #include <stdio.h>
    #include <stdlib.h>
    
    #define NUM 20
    
    int main(int argc, char **argv)
    {
       int i;
    
       printf("Hello -- I am pid %d\n", getpid());
       fflush(stdout);
       for (i = 0; i < NUM; ++i) {
           printf("Sleeping... %d of %d\n", i + 1, NUM);
           fflush(stdout);
           sleep(1);
       }
    
       return 0;
    }
    -----
    
    I run that app via "cr_run", and then in a different window, I 
    cr_checkpoint the PID of that process, the app seg faults and core 
    dumps:
    
    ------
    [14:54] bc02:~/mpi % cr_run ./non-mpi
    Hello -- I am pid 26329
    Sleeping... 1 of 20
    Sleeping... 2 of 20
    Sleeping... 3 of 20
    Sleeping... 4 of 20
    Sleeping... 5 of 20
    Sleeping... 6 of 20
    Sleeping... 7 of 20
    Segmentation fault (core dumped)
    [14:54] bc02:~/mpi %
    -----
    
    cr_checkpoint seems to complete normally -- it has an exit status of 0. 
      Here's the bt from the corefile that the app drops -- it seems to be 
    in the BLCR signal callback handler:
    
    -----
    #0  0x0048edff in cri_sig_handler (signr=0, siginfo=0x7, context=0x14)
         at cr_core.c:269
    #1  0x08048486 in main (argc=1, argv=0xbfffae04) at non-mpi.c:16
    -----
    
    The line in question is returning from the function cri_sig_handler().
    
    Any ideas what's going on here?  Are there any known issues with RHAS 
    kernels, or this particular kernel?
    
    Thanks!
    
    -- 
    {+} Jeff Squyres
    {+} jsquyres@lam-mpi.org
    {+} http://www.lam-mpi.org/
    

  • Next message: Richard Hu: "multiple checkpoints"