From: tingyu (tz9_at_msstate.edu)
Date: Fri Jan 02 2004 - 10:00:42 PST
I just installed blcr with lam-mpi 7.1a1, and blcr works great with non-mpi programs. But when I tried to use it to checkpoint a MPI program. the cr_checkpoint command halt and I got error from where the checkpointed MPI program "Real-time signal 31". What I have done is that i have a hello program as below, #include <stdio.h> #include <stdlib.h> #include <mpi.h> #include <sys/types.h> int main(int argc, char **argv) { int rank, size; int i; /* Start up MPI */ MPI_Init(&argc, &argv); /* Get some info about MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /* Print out the canonical "hello world" message */ printf("this is pid %d\n", (int) getpid());//this is only pid on individual nodes, not pid of // mpirun printf("Hello, world! I am %d of %d\n", rank, size); for(i =0; i < 100; i++) { sleep(1); } /* All done */ MPI_Finalize(); return 0; } Then I start it on one console with "mpirun -np 2 ./hello", and "su" with root on another console, use "cr_checkpoint pid-of-mpirun", then error happened. Another question is, is it possible that we use cr_checkpoint to checkpoint some processes in a mpi program, not all the mpi program? thanks! Tingyu