From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Mar 26 2007 - 09:58:55 PST
Yuan, I've not encountered this problem before. It looks as if something is triggering a LAM-internal error message. It is possible that this is a result of a BLCR problem, or it could be a LAM/MPI problem. If the problem *is* in BLCR, then there is not enough information here to try to find it. I see that you have also asked on the LAM/MPI mailing list, and that Josh Hursey made a suggestion there. I am monitoring that thread and will make any BLCR-specific comments if I can. However, at this point I don't have any ideas beyond Josh's suggestion to explicitly set the rpi module to crtcp. -Paul Yuan Wan wrote: > > Hi all, > > I got some problem when checkpointing lam/mpi code using blcr. > > My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19) > I have built blcr-0.5.0 and it works well with serial codes. > > I built LAM/MPI 7.1.2 > --------------------------------------------- > $ ./configure --prefix=/home/pst/lam > --with-rsh="ssh -x" > --with-cr-blcr=/home/pst/blcr $ make > $ make install > --------------------------------------------- > > The laminfo output is > ----------------------------------------------------- > LAM/MPI: 7.1.2 > Prefix: /home/pst/lam > Architecture: i686-pc-linux-gnu > Configured by: pst > Configured on: Sat Mar 24 00:40:42 GMT 2007 > Configure host: master00 > Memory manager: ptmalloc2 > C bindings: yes > C++ bindings: yes > Fortran bindings: yes > C compiler: gcc > C++ compiler: g++ > Fortran compiler: g77 > Fortran symbols: double_underscore > C profiling: yes > C++ profiling: yes > Fortran profiling: yes > C++ exceptions: no > Thread support: yes > ROMIO support: yes > IMPI support: no > Debug support: no > Purify clean: no > SSI boot: globus (API v1.1, Module v0.6) > SSI boot: rsh (API v1.1, Module v1.1) > SSI boot: slurm (API v1.1, Module v1.0) > SSI coll: lam_basic (API v1.1, Module v7.1) > SSI coll: shmem (API v1.1, Module v1.0) > SSI coll: smp (API v1.1, Module v1.2) > SSI rpi: crtcp (API v1.1, Module v1.1) > SSI rpi: lamd (API v1.0, Module v7.1) > SSI rpi: sysv (API v1.0, Module v7.1) > SSI rpi: tcp (API v1.0, Module v7.1) > SSI rpi: usysv (API v1.0, Module v7.1) > SSI cr: blcr (API v1.0, Module v1.1) > SSI cr: self (API v1.0, Module v1.0) > -------------------------------------------------------- > > > My parallel code works well with lam without any checkpoint > $ mpirun -np 2 ./job > > Then I run my parallel job in checkpointable way > $ mpirun -np 2 -ssi cr blcr ./rotating > > And checkpoint this job in another window > $ lamcheckpoint -ssi cr blcr -pid 11928 > > This operation produces a context file for mpirun > > "context.mpirun.11928" > > plus two context files for the job > > "context.11928-n0-11929" > "context.11928-n0-11930" > > Seems so far so good :) > ------------------------------------------------------- > > However, when I restart the job with the context file: > $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file > ~/context.mpirun.11928 > > I got the following error: > > Results CORRECT on rank 0 ["This line is the output in code"] > > MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, > MPI_COMM_WORLD) > Rank (0, MPI_COMM_WORLD): Call stack within LAM: > Rank (0, MPI_COMM_WORLD): - MPI_Finalize() > Rank (0, MPI_COMM_WORLD): - main() > ----------------------------------------------------------------------------- > > It seems that [at least] one of the processes that was started with > mpirun did not invoke MPI_INIT before quitting (it is possible that > more than one process did not invoke MPI_INIT -- mpirun was only > notified of the first one, which was on node n0). > > mpirun can *only* be used with MPI programs (i.e., programs that > invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program > to run non-MPI programs over the lambooted nodes. > ----------------------------------------------------------------------------- > > > Anyone met this problem before and know how to solve it? > > Many Thanks > > > --Yuan > > > Yuan Wan -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900