Re: BLCR problem on RHAS

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 12:48:58 PST

Next message: Paul H. Hargrove: "Re: multiple checkpoints"

Previous message: Richard Hu: "multiple checkpoints"
In reply to: Jeff Squyres: "BLCR problem on RHAS"

We don't have any known issues with this kernel/distro, or any positive 
results reported either.  The args shown for cri_sig_handler() make no 
sense, so I suspect the stack has been corrupted prior to this point, 
leading to the SEGV when a bogus return address is popped.

Jeff, could you please go to http://mantis.lbl.gov/bugzilla and enter a 
bug report for this.

-Paul

Jeff Squyres wrote:

> Guys --
>
> We're trying to get BLCR going on one of the big linux clusters here 
> at IU and are running into problems -- even with non-MPI apps.  Here's 
> the details:
>
> - Linux bc02 2.4.21-27.0.2.EL.050316 #2 SMP Wed Mar 16 14:17:19 EST 
> 2005 i686 i686 i386 GNU/Linux
> - Dual Xeon 2.4GHz hardware
> - BLCR v0.4.0
> - blcr and vmadump_blcr kernel modules are successfully loaded
>
> If I have a simple non-MPI app:
>
> -----
> #include <stdio.h>
> #include <stdlib.h>
>
> #define NUM 20
>
> int main(int argc, char **argv)
> {
>   int i;
>
>   printf("Hello -- I am pid %d\n", getpid());
>   fflush(stdout);
>   for (i = 0; i < NUM; ++i) {
>       printf("Sleeping... %d of %d\n", i + 1, NUM);
>       fflush(stdout);
>       sleep(1);
>   }
>
>   return 0;
> }
> -----
>
> I run that app via "cr_run", and then in a different window, I 
> cr_checkpoint the PID of that process, the app seg faults and core dumps:
>
> ------
> [14:54] bc02:~/mpi % cr_run ./non-mpi
> Hello -- I am pid 26329
> Sleeping... 1 of 20
> Sleeping... 2 of 20
> Sleeping... 3 of 20
> Sleeping... 4 of 20
> Sleeping... 5 of 20
> Sleeping... 6 of 20
> Sleeping... 7 of 20
> Segmentation fault (core dumped)
> [14:54] bc02:~/mpi %
> -----
>
> cr_checkpoint seems to complete normally -- it has an exit status of 
> 0.  Here's the bt from the corefile that the app drops -- it seems to 
> be in the BLCR signal callback handler:
>
> -----
> #0  0x0048edff in cri_sig_handler (signr=0, siginfo=0x7, context=0x14)
>     at cr_core.c:269
> #1  0x08048486 in main (argc=1, argv=0xbfffae04) at non-mpi.c:16
> -----
>
> The line in question is returning from the function cri_sig_handler().
>
> Any ideas what's going on here?  Are there any known issues with RHAS 
> kernels, or this particular kernel?
>
> Thanks!
>

Next message: Paul H. Hargrove: "Re: multiple checkpoints"

Previous message: Richard Hu: "multiple checkpoints"
In reply to: Jeff Squyres: "BLCR problem on RHAS"

Date view	Thread view	Subject view	Author view	Attachment view