From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 12:48:58 PST
We don't have any known issues with this kernel/distro, or any positive results reported either. The args shown for cri_sig_handler() make no sense, so I suspect the stack has been corrupted prior to this point, leading to the SEGV when a bogus return address is popped. Jeff, could you please go to http://mantis.lbl.gov/bugzilla and enter a bug report for this. -Paul Jeff Squyres wrote: > Guys -- > > We're trying to get BLCR going on one of the big linux clusters here > at IU and are running into problems -- even with non-MPI apps. Here's > the details: > > - Linux bc02 2.4.21-27.0.2.EL.050316 #2 SMP Wed Mar 16 14:17:19 EST > 2005 i686 i686 i386 GNU/Linux > - Dual Xeon 2.4GHz hardware > - BLCR v0.4.0 > - blcr and vmadump_blcr kernel modules are successfully loaded > > If I have a simple non-MPI app: > > ----- > #include <stdio.h> > #include <stdlib.h> > > #define NUM 20 > > int main(int argc, char **argv) > { > int i; > > printf("Hello -- I am pid %d\n", getpid()); > fflush(stdout); > for (i = 0; i < NUM; ++i) { > printf("Sleeping... %d of %d\n", i + 1, NUM); > fflush(stdout); > sleep(1); > } > > return 0; > } > ----- > > I run that app via "cr_run", and then in a different window, I > cr_checkpoint the PID of that process, the app seg faults and core dumps: > > ------ > [14:54] bc02:~/mpi % cr_run ./non-mpi > Hello -- I am pid 26329 > Sleeping... 1 of 20 > Sleeping... 2 of 20 > Sleeping... 3 of 20 > Sleeping... 4 of 20 > Sleeping... 5 of 20 > Sleeping... 6 of 20 > Sleeping... 7 of 20 > Segmentation fault (core dumped) > [14:54] bc02:~/mpi % > ----- > > cr_checkpoint seems to complete normally -- it has an exit status of > 0. Here's the bt from the corefile that the app drops -- it seems to > be in the BLCR signal callback handler: > > ----- > #0 0x0048edff in cri_sig_handler (signr=0, siginfo=0x7, context=0x14) > at cr_core.c:269 > #1 0x08048486 in main (argc=1, argv=0xbfffae04) at non-mpi.c:16 > ----- > > The line in question is returning from the function cri_sig_handler(). > > Any ideas what's going on here? Are there any known issues with RHAS > kernels, or this particular kernel? > > Thanks! >