Re: BLCR problem on RHAS

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 12:48:58 PST

  • Next message: Paul H. Hargrove: "Re: multiple checkpoints"
    We don't have any known issues with this kernel/distro, or any positive 
    results reported either.  The args shown for cri_sig_handler() make no 
    sense, so I suspect the stack has been corrupted prior to this point, 
    leading to the SEGV when a bogus return address is popped.
    
    Jeff, could you please go to http://mantis.lbl.gov/bugzilla and enter a 
    bug report for this.
    
    -Paul
    
    Jeff Squyres wrote:
    
    > Guys --
    >
    > We're trying to get BLCR going on one of the big linux clusters here 
    > at IU and are running into problems -- even with non-MPI apps.  Here's 
    > the details:
    >
    > - Linux bc02 2.4.21-27.0.2.EL.050316 #2 SMP Wed Mar 16 14:17:19 EST 
    > 2005 i686 i686 i386 GNU/Linux
    > - Dual Xeon 2.4GHz hardware
    > - BLCR v0.4.0
    > - blcr and vmadump_blcr kernel modules are successfully loaded
    >
    > If I have a simple non-MPI app:
    >
    > -----
    > #include <stdio.h>
    > #include <stdlib.h>
    >
    > #define NUM 20
    >
    > int main(int argc, char **argv)
    > {
    >   int i;
    >
    >   printf("Hello -- I am pid %d\n", getpid());
    >   fflush(stdout);
    >   for (i = 0; i < NUM; ++i) {
    >       printf("Sleeping... %d of %d\n", i + 1, NUM);
    >       fflush(stdout);
    >       sleep(1);
    >   }
    >
    >   return 0;
    > }
    > -----
    >
    > I run that app via "cr_run", and then in a different window, I 
    > cr_checkpoint the PID of that process, the app seg faults and core dumps:
    >
    > ------
    > [14:54] bc02:~/mpi % cr_run ./non-mpi
    > Hello -- I am pid 26329
    > Sleeping... 1 of 20
    > Sleeping... 2 of 20
    > Sleeping... 3 of 20
    > Sleeping... 4 of 20
    > Sleeping... 5 of 20
    > Sleeping... 6 of 20
    > Sleeping... 7 of 20
    > Segmentation fault (core dumped)
    > [14:54] bc02:~/mpi %
    > -----
    >
    > cr_checkpoint seems to complete normally -- it has an exit status of 
    > 0.  Here's the bt from the corefile that the app drops -- it seems to 
    > be in the BLCR signal callback handler:
    >
    > -----
    > #0  0x0048edff in cri_sig_handler (signr=0, siginfo=0x7, context=0x14)
    >     at cr_core.c:269
    > #1  0x08048486 in main (argc=1, argv=0xbfffae04) at non-mpi.c:16
    > -----
    >
    > The line in question is returning from the function cri_sig_handler().
    >
    > Any ideas what's going on here?  Are there any known issues with RHAS 
    > kernels, or this particular kernel?
    >
    > Thanks!
    >
    

  • Next message: Paul H. Hargrove: "Re: multiple checkpoints"