Re: problem: checkpoint lam/mpi with BLCR

Date view	Thread view	Subject view	Author view	Attachment view

From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Tue Mar 27 2007 - 00:49:55 PST

Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Previous message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
In reply to: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
Next in thread: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
Reply: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

On Mon, 26 Mar 2007, Paul H. Hargrove wrote:

Hi Paul,

Thanks for your reply.

I have tried to explicitly use "crtcp" module, but it caused a
failure on checkpoint:

$ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
$ lamcheckpoint -ssi cr blcr -pid 17256

-----------------------------------------------------------------------------
Encountered a failure in the SSI types while continuing from
checkpoint.  Aborting in despair :-(
-----------------------------------------------------------------------------
And The code never exit after it getting the end.
I check the 'ps' list and found there are two 'mpirun' and 
three'checkpoint'processes running:
---------------------------------------
17255 ?        00:00:00 lamd
17256 pts/2    00:00:00 mpirun
17257 ?        00:00:15 rotating
17258 ?        00:00:15 rotating
17263 pts/3    00:00:00 lamcheckpoint
17264 pts/3    00:00:00 cr_checkpoint
17265 pts/2    00:00:00 mpirun
17266 ?        00:00:00 cr_checkpoint
17267 ?        00:00:00 cr_checkpoint
---------------------------------------

--Yuan



>
> Yuan,
>
> I've not encountered this problem before.  It looks as if something is 
> triggering a LAM-internal error message.  It is possible that this is a 
> result of a BLCR problem, or it could be a LAM/MPI problem.  If the problem 
> *is* in BLCR, then there is not enough information here to try to find it.
> I see that you have also asked on the LAM/MPI mailing list, and that Josh 
> Hursey made a suggestion there.  I am monitoring that thread and will make 
> any BLCR-specific comments if I can.  However, at this point I don't have any 
> ideas beyond Josh's suggestion to explicitly set the rpi module to crtcp.
>
> -Paul
>
> Yuan Wan wrote:
>> 
>> Hi all,
>> 
>> I got some problem when checkpointing lam/mpi code using blcr.
>> 
>> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
>> I have built blcr-0.5.0 and it works well with serial codes.
>> 
>> I built LAM/MPI 7.1.2
>> ---------------------------------------------
>> $ ./configure --prefix=/home/pst/lam
>>             --with-rsh="ssh -x"
>>             --with-cr-blcr=/home/pst/blcr $ make
>> $ make install
>> ---------------------------------------------
>> 
>> The laminfo output is
>> -----------------------------------------------------
>>              LAM/MPI: 7.1.2
>>               Prefix: /home/pst/lam
>>         Architecture: i686-pc-linux-gnu
>>        Configured by: pst
>>        Configured on: Sat Mar 24 00:40:42 GMT 2007
>>       Configure host: master00
>>       Memory manager: ptmalloc2
>>           C bindings: yes
>>         C++ bindings: yes
>>     Fortran bindings: yes
>>           C compiler: gcc
>>         C++ compiler: g++
>>     Fortran compiler: g77
>>      Fortran symbols: double_underscore
>>          C profiling: yes
>>        C++ profiling: yes
>>    Fortran profiling: yes
>>       C++ exceptions: no
>>       Thread support: yes
>>        ROMIO support: yes
>>         IMPI support: no
>>        Debug support: no
>>         Purify clean: no
>>             SSI boot: globus (API v1.1, Module v0.6)
>>             SSI boot: rsh (API v1.1, Module v1.1)
>>             SSI boot: slurm (API v1.1, Module v1.0)
>>             SSI coll: lam_basic (API v1.1, Module v7.1)
>>             SSI coll: shmem (API v1.1, Module v1.0)
>>             SSI coll: smp (API v1.1, Module v1.2)
>>              SSI rpi: crtcp (API v1.1, Module v1.1)
>>              SSI rpi: lamd (API v1.0, Module v7.1)
>>              SSI rpi: sysv (API v1.0, Module v7.1)
>>              SSI rpi: tcp (API v1.0, Module v7.1)
>>              SSI rpi: usysv (API v1.0, Module v7.1)
>>               SSI cr: blcr (API v1.0, Module v1.1)
>>               SSI cr: self (API v1.0, Module v1.0)
>> --------------------------------------------------------
>> 
>> 
>> My parallel code works well with lam without any checkpoint
>> $ mpirun -np 2 ./job
>> 
>> Then I run my parallel job in checkpointable way
>> $ mpirun -np 2 -ssi cr blcr ./rotating
>> 
>> And checkpoint this job in another window
>> $ lamcheckpoint -ssi cr blcr -pid 11928
>> 
>> This operation produces a context file for mpirun
>> 
>> "context.mpirun.11928"
>> 
>> plus two context files for the job
>> 
>> "context.11928-n0-11929"
>> "context.11928-n0-11930"
>> 
>> Seems so far so good :)
>> -------------------------------------------------------
>> 
>> However, when I restart the job with the context file:
>> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928
>> 
>> I got the following error:
>> 
>> Results CORRECT on rank 0  ["This line is the output in code"]
>> 
>> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
>> MPI_COMM_WORLD)
>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>> Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
>> Rank (0, MPI_COMM_WORLD):  - main()
>> 
>> ----------------------------------------------------------------------------- 
>> It seems that [at least] one of the processes that was started with
>> mpirun did not invoke MPI_INIT before quitting (it is possible that
>> more than one process did not invoke MPI_INIT -- mpirun was only
>> notified of the first one, which was on node n0).
>> 
>> mpirun can *only* be used with MPI programs (i.e., programs that
>> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
>> to run non-MPI programs over the lambooted nodes.
>> 
>> ----------------------------------------------------------------------------- 
>> 
>> Anyone met this problem before and know how to solve it?
>> 
>> Many Thanks
>> 
>> 
>> --Yuan
>> 
>> 
>> Yuan Wan
>
>
>

-- 
Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: [email protected]

2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ

Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Previous message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
In reply to: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
Next in thread: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
Reply: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Date view	Thread view	Subject view	Author view	Attachment view