BLCR problem on RHAS

Date view	Thread view	Subject view	Author view	Attachment view

From: Jeff Squyres (jsquyres_at_lam-mpi.org)
Date: Wed Mar 30 2005 - 11:58:39 PST

Next message: Richard Hu: "multiple checkpoints"

Previous message: Paul H. Hargrove: "Re: file appending related"
Next in thread: Paul H. Hargrove: "Re: BLCR problem on RHAS"
Reply: Paul H. Hargrove: "Re: BLCR problem on RHAS"

Guys --

We're trying to get BLCR going on one of the big linux clusters here at 
IU and are running into problems -- even with non-MPI apps.  Here's the 
details:

- Linux bc02 2.4.21-27.0.2.EL.050316 #2 SMP Wed Mar 16 14:17:19 EST 
2005 i686 i686 i386 GNU/Linux
- Dual Xeon 2.4GHz hardware
- BLCR v0.4.0
- blcr and vmadump_blcr kernel modules are successfully loaded

If I have a simple non-MPI app:

-----
#include <stdio.h>
#include <stdlib.h>

#define NUM 20

int main(int argc, char **argv)
{
   int i;

   printf("Hello -- I am pid %d\n", getpid());
   fflush(stdout);
   for (i = 0; i < NUM; ++i) {
       printf("Sleeping... %d of %d\n", i + 1, NUM);
       fflush(stdout);
       sleep(1);
   }

   return 0;
}
-----

I run that app via "cr_run", and then in a different window, I 
cr_checkpoint the PID of that process, the app seg faults and core 
dumps:

------
[14:54] bc02:~/mpi % cr_run ./non-mpi
Hello -- I am pid 26329
Sleeping... 1 of 20
Sleeping... 2 of 20
Sleeping... 3 of 20
Sleeping... 4 of 20
Sleeping... 5 of 20
Sleeping... 6 of 20
Sleeping... 7 of 20
Segmentation fault (core dumped)
[14:54] bc02:~/mpi %
-----

cr_checkpoint seems to complete normally -- it has an exit status of 0. 
  Here's the bt from the corefile that the app drops -- it seems to be 
in the BLCR signal callback handler:

-----
#0  0x0048edff in cri_sig_handler (signr=0, siginfo=0x7, context=0x14)
     at cr_core.c:269
#1  0x08048486 in main (argc=1, argv=0xbfffae04) at non-mpi.c:16
-----

The line in question is returning from the function cri_sig_handler().

Any ideas what's going on here?  Are there any known issues with RHAS 
kernels, or this particular kernel?

Thanks!

-- 
{+} Jeff Squyres
{+} [email protected]
{+} http://www.lam-mpi.org/

Next message: Richard Hu: "multiple checkpoints"

Previous message: Paul H. Hargrove: "Re: file appending related"
Next in thread: Paul H. Hargrove: "Re: BLCR problem on RHAS"
Reply: Paul H. Hargrove: "Re: BLCR problem on RHAS"

Date view	Thread view	Subject view	Author view	Attachment view