From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Tue Mar 27 2007 - 00:49:55 PST
On Mon, 26 Mar 2007, Paul H. Hargrove wrote: Hi Paul, Thanks for your reply. I have tried to explicitly use "crtcp" module, but it caused a failure on checkpoint: $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating $ lamcheckpoint -ssi cr blcr -pid 17256 ----------------------------------------------------------------------------- Encountered a failure in the SSI types while continuing from checkpoint. Aborting in despair :-( ----------------------------------------------------------------------------- And The code never exit after it getting the end. I check the 'ps' list and found there are two 'mpirun' and three'checkpoint'processes running: --------------------------------------- 17255 ? 00:00:00 lamd 17256 pts/2 00:00:00 mpirun 17257 ? 00:00:15 rotating 17258 ? 00:00:15 rotating 17263 pts/3 00:00:00 lamcheckpoint 17264 pts/3 00:00:00 cr_checkpoint 17265 pts/2 00:00:00 mpirun 17266 ? 00:00:00 cr_checkpoint 17267 ? 00:00:00 cr_checkpoint --------------------------------------- --Yuan > > Yuan, > > I've not encountered this problem before. It looks as if something is > triggering a LAM-internal error message. It is possible that this is a > result of a BLCR problem, or it could be a LAM/MPI problem. If the problem > *is* in BLCR, then there is not enough information here to try to find it. > I see that you have also asked on the LAM/MPI mailing list, and that Josh > Hursey made a suggestion there. I am monitoring that thread and will make > any BLCR-specific comments if I can. However, at this point I don't have any > ideas beyond Josh's suggestion to explicitly set the rpi module to crtcp. > > -Paul > > Yuan Wan wrote: >> >> Hi all, >> >> I got some problem when checkpointing lam/mpi code using blcr. >> >> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19) >> I have built blcr-0.5.0 and it works well with serial codes. >> >> I built LAM/MPI 7.1.2 >> --------------------------------------------- >> $ ./configure --prefix=/home/pst/lam >> --with-rsh="ssh -x" >> --with-cr-blcr=/home/pst/blcr $ make >> $ make install >> --------------------------------------------- >> >> The laminfo output is >> ----------------------------------------------------- >> LAM/MPI: 7.1.2 >> Prefix: /home/pst/lam >> Architecture: i686-pc-linux-gnu >> Configured by: pst >> Configured on: Sat Mar 24 00:40:42 GMT 2007 >> Configure host: master00 >> Memory manager: ptmalloc2 >> C bindings: yes >> C++ bindings: yes >> Fortran bindings: yes >> C compiler: gcc >> C++ compiler: g++ >> Fortran compiler: g77 >> Fortran symbols: double_underscore >> C profiling: yes >> C++ profiling: yes >> Fortran profiling: yes >> C++ exceptions: no >> Thread support: yes >> ROMIO support: yes >> IMPI support: no >> Debug support: no >> Purify clean: no >> SSI boot: globus (API v1.1, Module v0.6) >> SSI boot: rsh (API v1.1, Module v1.1) >> SSI boot: slurm (API v1.1, Module v1.0) >> SSI coll: lam_basic (API v1.1, Module v7.1) >> SSI coll: shmem (API v1.1, Module v1.0) >> SSI coll: smp (API v1.1, Module v1.2) >> SSI rpi: crtcp (API v1.1, Module v1.1) >> SSI rpi: lamd (API v1.0, Module v7.1) >> SSI rpi: sysv (API v1.0, Module v7.1) >> SSI rpi: tcp (API v1.0, Module v7.1) >> SSI rpi: usysv (API v1.0, Module v7.1) >> SSI cr: blcr (API v1.0, Module v1.1) >> SSI cr: self (API v1.0, Module v1.0) >> -------------------------------------------------------- >> >> >> My parallel code works well with lam without any checkpoint >> $ mpirun -np 2 ./job >> >> Then I run my parallel job in checkpointable way >> $ mpirun -np 2 -ssi cr blcr ./rotating >> >> And checkpoint this job in another window >> $ lamcheckpoint -ssi cr blcr -pid 11928 >> >> This operation produces a context file for mpirun >> >> "context.mpirun.11928" >> >> plus two context files for the job >> >> "context.11928-n0-11929" >> "context.11928-n0-11930" >> >> Seems so far so good :) >> ------------------------------------------------------- >> >> However, when I restart the job with the context file: >> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928 >> >> I got the following error: >> >> Results CORRECT on rank 0 ["This line is the output in code"] >> >> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, >> MPI_COMM_WORLD) >> Rank (0, MPI_COMM_WORLD): Call stack within LAM: >> Rank (0, MPI_COMM_WORLD): - MPI_Finalize() >> Rank (0, MPI_COMM_WORLD): - main() >> >> ----------------------------------------------------------------------------- >> It seems that [at least] one of the processes that was started with >> mpirun did not invoke MPI_INIT before quitting (it is possible that >> more than one process did not invoke MPI_INIT -- mpirun was only >> notified of the first one, which was on node n0). >> >> mpirun can *only* be used with MPI programs (i.e., programs that >> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program >> to run non-MPI programs over the lambooted nodes. >> >> ----------------------------------------------------------------------------- >> >> Anyone met this problem before and know how to solve it? >> >> Many Thanks >> >> >> --Yuan >> >> >> Yuan Wan > > > -- Unix Section Information Services Infrastructure Division University of Edinburgh tel: 0131 650 4985 email: [email protected] 2032 Computing Services, JCMB The King's Buildings, Edinburgh, EH9 3JZ