From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 30 2005 - 11:58:39 PST
Guys -- We're trying to get BLCR going on one of the big linux clusters here at IU and are running into problems -- even with non-MPI apps. Here's the details: - Linux bc02 2.4.21-27.0.2.EL.050316 #2 SMP Wed Mar 16 14:17:19 EST 2005 i686 i686 i386 GNU/Linux - Dual Xeon 2.4GHz hardware - BLCR v0.4.0 - blcr and vmadump_blcr kernel modules are successfully loaded If I have a simple non-MPI app: ----- #include <stdio.h> #include <stdlib.h> #define NUM 20 int main(int argc, char **argv) { int i; printf("Hello -- I am pid %d\n", getpid()); fflush(stdout); for (i = 0; i < NUM; ++i) { printf("Sleeping... %d of %d\n", i + 1, NUM); fflush(stdout); sleep(1); } return 0; } ----- I run that app via "cr_run", and then in a different window, I cr_checkpoint the PID of that process, the app seg faults and core dumps: ------ [14:54] bc02:~/mpi % cr_run ./non-mpi Hello -- I am pid 26329 Sleeping... 1 of 20 Sleeping... 2 of 20 Sleeping... 3 of 20 Sleeping... 4 of 20 Sleeping... 5 of 20 Sleeping... 6 of 20 Sleeping... 7 of 20 Segmentation fault (core dumped) [14:54] bc02:~/mpi % ----- cr_checkpoint seems to complete normally -- it has an exit status of 0. Here's the bt from the corefile that the app drops -- it seems to be in the BLCR signal callback handler: ----- #0 0x0048edff in cri_sig_handler (signr=0, siginfo=0x7, context=0x14) at cr_core.c:269 #1 0x08048486 in main (argc=1, argv=0xbfffae04) at non-mpi.c:16 ----- The line in question is returning from the function cri_sig_handler(). Any ideas what's going on here? Are there any known issues with RHAS kernels, or this particular kernel? Thanks! -- {+} Jeff Squyres {+} [email protected] {+} http://www.lam-mpi.org/