blcr error with "Real-time signal 31"

From: tingyu (tz9_at_msstate.edu)
Date: Fri Jan 02 2004 - 10:00:42 PST


I just installed blcr with lam-mpi 7.1a1, and blcr works great with
non-mpi programs. But when I tried to use it to checkpoint a MPI
program. the cr_checkpoint command halt and I got error
from where the checkpointed MPI program "Real-time signal 31".

What I have done is that i have a hello program as below,

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <sys/types.h>

int
main(int argc, char **argv)
{
int rank, size;
int i;

/* Start up MPI */

MPI_Init(&argc, &argv);

/* Get some info about MPI */

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

/* Print out the canonical "hello world" message */
printf("this is pid %d\n", (int) getpid());//this is only pid on
individual nodes, not pid of // mpirun

printf("Hello, world! I am %d of %d\n", rank, size);
for(i =0; i < 100; i++)
{
sleep(1);
}
/* All done */

MPI_Finalize();
return 0;
}

Then I start it on one console with "mpirun -np 2 ./hello", and "su"
with root on another console, use "cr_checkpoint pid-of-mpirun", then
error happened.


Another question is, is it possible that we use cr_checkpoint to
checkpoint some processes in a mpi program, not all the mpi program?

thanks!


Tingyu