From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 27 2007 - 11:39:03 PST
Yuan, I've certainly not seen anything like that before. The fact that the error message changed after adding "-ssi rpi crtcp" suggests to me that Josh was on the right track. However, the new failure mode looks even more ominous. My best guess would be that something changed in either BLCR or FC6 that has broken the assumptions being made by the crtcp rpi module in LAM/MPI. I don't currently have a system on which to test LAM/MPI+BLCR, so I can't verify this. Depending on what has broken, the fix might belong in either LAM/MPI or BLCR. I am afraid I probably won't have any chance to look at this in detail for a couple weeks at least. Not sure about the 2 mpirun instances, but would guess that one of them might be internal to lamcheckpoint operation. Passing an option such as "-f" or "-l" to ps would give the parent id (PPID) and make it clear who/what started the 2nd mpirun. As for the the 3 cr_checkpoint instances, they correspond to the 3 context files you would eventually get: one for the mpirun and one for each of the two "rotating" processes. -Paul Yuan Wan wrote: > On Mon, 26 Mar 2007, Paul H. Hargrove wrote: > > Hi Paul, > > Thanks for your reply. > > I have tried to explicitly use "crtcp" module, but it caused a > failure on checkpoint: > > $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating > $ lamcheckpoint -ssi cr blcr -pid 17256 > > ----------------------------------------------------------------------------- > > Encountered a failure in the SSI types while continuing from > checkpoint. Aborting in despair :-( > ----------------------------------------------------------------------------- > > And The code never exit after it getting the end. > I check the 'ps' list and found there are two 'mpirun' and > three'checkpoint'processes running: > --------------------------------------- > 17255 ? 00:00:00 lamd > 17256 pts/2 00:00:00 mpirun > 17257 ? 00:00:15 rotating > 17258 ? 00:00:15 rotating > 17263 pts/3 00:00:00 lamcheckpoint > 17264 pts/3 00:00:00 cr_checkpoint > 17265 pts/2 00:00:00 mpirun > 17266 ? 00:00:00 cr_checkpoint > 17267 ? 00:00:00 cr_checkpoint > --------------------------------------- > > --Yuan > > > >> >> Yuan, >> >> I've not encountered this problem before. It looks as if something is >> triggering a LAM-internal error message. It is possible that this is >> a result of a BLCR problem, or it could be a LAM/MPI problem. If the >> problem *is* in BLCR, then there is not enough information here to try >> to find it. >> I see that you have also asked on the LAM/MPI mailing list, and that >> Josh Hursey made a suggestion there. I am monitoring that thread and >> will make any BLCR-specific comments if I can. However, at this point >> I don't have any ideas beyond Josh's suggestion to explicitly set the >> rpi module to crtcp. >> >> -Paul >> >> Yuan Wan wrote: >>> >>> Hi all, >>> >>> I got some problem when checkpointing lam/mpi code using blcr. >>> >>> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19) >>> I have built blcr-0.5.0 and it works well with serial codes. >>> >>> I built LAM/MPI 7.1.2 >>> --------------------------------------------- >>> $ ./configure --prefix=/home/pst/lam >>> --with-rsh="ssh -x" >>> --with-cr-blcr=/home/pst/blcr $ make >>> $ make install >>> --------------------------------------------- >>> >>> The laminfo output is >>> ----------------------------------------------------- >>> LAM/MPI: 7.1.2 >>> Prefix: /home/pst/lam >>> Architecture: i686-pc-linux-gnu >>> Configured by: pst >>> Configured on: Sat Mar 24 00:40:42 GMT 2007 >>> Configure host: master00 >>> Memory manager: ptmalloc2 >>> C bindings: yes >>> C++ bindings: yes >>> Fortran bindings: yes >>> C compiler: gcc >>> C++ compiler: g++ >>> Fortran compiler: g77 >>> Fortran symbols: double_underscore >>> C profiling: yes >>> C++ profiling: yes >>> Fortran profiling: yes >>> C++ exceptions: no >>> Thread support: yes >>> ROMIO support: yes >>> IMPI support: no >>> Debug support: no >>> Purify clean: no >>> SSI boot: globus (API v1.1, Module v0.6) >>> SSI boot: rsh (API v1.1, Module v1.1) >>> SSI boot: slurm (API v1.1, Module v1.0) >>> SSI coll: lam_basic (API v1.1, Module v7.1) >>> SSI coll: shmem (API v1.1, Module v1.0) >>> SSI coll: smp (API v1.1, Module v1.2) >>> SSI rpi: crtcp (API v1.1, Module v1.1) >>> SSI rpi: lamd (API v1.0, Module v7.1) >>> SSI rpi: sysv (API v1.0, Module v7.1) >>> SSI rpi: tcp (API v1.0, Module v7.1) >>> SSI rpi: usysv (API v1.0, Module v7.1) >>> SSI cr: blcr (API v1.0, Module v1.1) >>> SSI cr: self (API v1.0, Module v1.0) >>> -------------------------------------------------------- >>> >>> >>> My parallel code works well with lam without any checkpoint >>> $ mpirun -np 2 ./job >>> >>> Then I run my parallel job in checkpointable way >>> $ mpirun -np 2 -ssi cr blcr ./rotating >>> >>> And checkpoint this job in another window >>> $ lamcheckpoint -ssi cr blcr -pid 11928 >>> >>> This operation produces a context file for mpirun >>> >>> "context.mpirun.11928" >>> >>> plus two context files for the job >>> >>> "context.11928-n0-11929" >>> "context.11928-n0-11930" >>> >>> Seems so far so good :) >>> ------------------------------------------------------- >>> >>> However, when I restart the job with the context file: >>> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file >>> ~/context.mpirun.11928 >>> >>> I got the following error: >>> >>> Results CORRECT on rank 0 ["This line is the output in code"] >>> >>> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, >>> MPI_COMM_WORLD) >>> Rank (0, MPI_COMM_WORLD): Call stack within LAM: >>> Rank (0, MPI_COMM_WORLD): - MPI_Finalize() >>> Rank (0, MPI_COMM_WORLD): - main() >>> >>> ----------------------------------------------------------------------------- >>> It seems that [at least] one of the processes that was started with >>> mpirun did not invoke MPI_INIT before quitting (it is possible that >>> more than one process did not invoke MPI_INIT -- mpirun was only >>> notified of the first one, which was on node n0). >>> >>> mpirun can *only* be used with MPI programs (i.e., programs that >>> invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program >>> to run non-MPI programs over the lambooted nodes. >>> >>> ----------------------------------------------------------------------------- >>> >>> Anyone met this problem before and know how to solve it? >>> >>> Many Thanks >>> >>> >>> --Yuan >>> >>> >>> Yuan Wan >> >> >> > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900