From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Mon Mar 26 2007 - 07:46:54 PST
Hi all, I got some problem when checkpointing lam/mpi code using blcr. My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19) I have built blcr-0.5.0 and it works well with serial codes. I built LAM/MPI 7.1.2 --------------------------------------------- $ ./configure --prefix=/home/pst/lam --with-rsh="ssh -x" --with-cr-blcr=/home/pst/blcr $ make $ make install --------------------------------------------- The laminfo output is ----------------------------------------------------- LAM/MPI: 7.1.2 Prefix: /home/pst/lam Architecture: i686-pc-linux-gnu Configured by: pst Configured on: Sat Mar 24 00:40:42 GMT 2007 Configure host: master00 Memory manager: ptmalloc2 C bindings: yes C++ bindings: yes Fortran bindings: yes C compiler: gcc C++ compiler: g++ Fortran compiler: g77 Fortran symbols: double_underscore C profiling: yes C++ profiling: yes Fortran profiling: yes C++ exceptions: no Thread support: yes ROMIO support: yes IMPI support: no Debug support: no Purify clean: no SSI boot: globus (API v1.1, Module v0.6) SSI boot: rsh (API v1.1, Module v1.1) SSI boot: slurm (API v1.1, Module v1.0) SSI coll: lam_basic (API v1.1, Module v7.1) SSI coll: shmem (API v1.1, Module v1.0) SSI coll: smp (API v1.1, Module v1.2) SSI rpi: crtcp (API v1.1, Module v1.1) SSI rpi: lamd (API v1.0, Module v7.1) SSI rpi: sysv (API v1.0, Module v7.1) SSI rpi: tcp (API v1.0, Module v7.1) SSI rpi: usysv (API v1.0, Module v7.1) SSI cr: blcr (API v1.0, Module v1.1) SSI cr: self (API v1.0, Module v1.0) -------------------------------------------------------- My parallel code works well with lam without any checkpoint $ mpirun -np 2 ./job Then I run my parallel job in checkpointable way $ mpirun -np 2 -ssi cr blcr ./rotating And checkpoint this job in another window $ lamcheckpoint -ssi cr blcr -pid 11928 This operation produces a context file for mpirun "context.mpirun.11928" plus two context files for the job "context.11928-n0-11929" "context.11928-n0-11930" Seems so far so good :) ------------------------------------------------------- However, when I restart the job with the context file: $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928 I got the following error: Results CORRECT on rank 0 ["This line is the output in code"] MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, MPI_COMM_WORLD) Rank (0, MPI_COMM_WORLD): Call stack within LAM: Rank (0, MPI_COMM_WORLD): - MPI_Finalize() Rank (0, MPI_COMM_WORLD): - main() ----------------------------------------------------------------------------- It seems that [at least] one of the processes that was started with mpirun did not invoke MPI_INIT before quitting (it is possible that more than one process did not invoke MPI_INIT -- mpirun was only notified of the first one, which was on node n0). mpirun can *only* be used with MPI programs (i.e., programs that invoke MPI_INIT and MPI_FINALIZE). You can use the "lamexec" program to run non-MPI programs over the lambooted nodes. ----------------------------------------------------------------------------- Anyone met this problem before and know how to solve it? Many Thanks --Yuan Yuan Wan -- Unix Section Information Services Infrastructure Division University of Edinburgh tel: 0131 650 4985 email: [email protected] 2032 Computing Services, JCMB The King's Buildings, Edinburgh, EH9 3JZ