Re: problem: checkpoint lam/mpi with BLCR

Date view	Thread view	Subject view	Author view	Attachment view
From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 27 2007 - 11:39:03 PST
Next message: Tom Spyrou: "Restart Failed: permission denied"
Previous message: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
In reply to: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
Yuan,

I've certainly not seen anything like that before.  The fact that the
error message changed after adding "-ssi rpi crtcp" suggests to me that
Josh was on the right track.  However, the new failure mode looks even
more ominous.

My best guess would be that something changed in either BLCR or FC6 that
has broken the assumptions being made by the crtcp rpi module in
LAM/MPI.  I don't currently have a system on which to test LAM/MPI+BLCR,
so I can't verify this.

Depending on what has broken, the fix might belong in either LAM/MPI or
BLCR.  I am afraid I probably won't have any chance to look at this in
detail for a couple weeks at least.

Not sure about the 2 mpirun instances, but would guess that one of them
might be internal to lamcheckpoint operation.  Passing an option such as
"-f" or "-l" to ps would give the parent id (PPID) and make it clear
who/what started the 2nd mpirun.  As for the the 3 cr_checkpoint
instances, they correspond to the 3 context files you would eventually
get: one for the mpirun and one for each of the two "rotating" processes.

-Paul

Yuan Wan wrote:
> On Mon, 26 Mar 2007, Paul H. Hargrove wrote:
> 
> Hi Paul,
> 
> Thanks for your reply.
> 
> I have tried to explicitly use "crtcp" module, but it caused a
> failure on checkpoint:
> 
> $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
> $ lamcheckpoint -ssi cr blcr -pid 17256
> 
> -----------------------------------------------------------------------------
> 
> Encountered a failure in the SSI types while continuing from
> checkpoint.  Aborting in despair :-(
> -----------------------------------------------------------------------------
> 
> And The code never exit after it getting the end.
> I check the 'ps' list and found there are two 'mpirun' and
> three'checkpoint'processes running:
> ---------------------------------------
> 17255 ?        00:00:00 lamd
> 17256 pts/2    00:00:00 mpirun
> 17257 ?        00:00:15 rotating
> 17258 ?        00:00:15 rotating
> 17263 pts/3    00:00:00 lamcheckpoint
> 17264 pts/3    00:00:00 cr_checkpoint
> 17265 pts/2    00:00:00 mpirun
> 17266 ?        00:00:00 cr_checkpoint
> 17267 ?        00:00:00 cr_checkpoint
> ---------------------------------------
> 
> --Yuan
> 
> 
> 
>>
>> Yuan,
>>
>> I've not encountered this problem before.  It looks as if something is
>> triggering a LAM-internal error message.  It is possible that this is
>> a result of a BLCR problem, or it could be a LAM/MPI problem.  If the
>> problem *is* in BLCR, then there is not enough information here to try
>> to find it.
>> I see that you have also asked on the LAM/MPI mailing list, and that
>> Josh Hursey made a suggestion there.  I am monitoring that thread and
>> will make any BLCR-specific comments if I can.  However, at this point
>> I don't have any ideas beyond Josh's suggestion to explicitly set the
>> rpi module to crtcp.
>>
>> -Paul
>>
>> Yuan Wan wrote:
>>>
>>> Hi all,
>>>
>>> I got some problem when checkpointing lam/mpi code using blcr.
>>>
>>> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
>>> I have built blcr-0.5.0 and it works well with serial codes.
>>>
>>> I built LAM/MPI 7.1.2
>>> ---------------------------------------------
>>> $ ./configure --prefix=/home/pst/lam
>>>             --with-rsh="ssh -x"
>>>             --with-cr-blcr=/home/pst/blcr $ make
>>> $ make install
>>> ---------------------------------------------
>>>
>>> The laminfo output is
>>> -----------------------------------------------------
>>>              LAM/MPI: 7.1.2
>>>               Prefix: /home/pst/lam
>>>         Architecture: i686-pc-linux-gnu
>>>        Configured by: pst
>>>        Configured on: Sat Mar 24 00:40:42 GMT 2007
>>>       Configure host: master00
>>>       Memory manager: ptmalloc2
>>>           C bindings: yes
>>>         C++ bindings: yes
>>>     Fortran bindings: yes
>>>           C compiler: gcc
>>>         C++ compiler: g++
>>>     Fortran compiler: g77
>>>      Fortran symbols: double_underscore
>>>          C profiling: yes
>>>        C++ profiling: yes
>>>    Fortran profiling: yes
>>>       C++ exceptions: no
>>>       Thread support: yes
>>>        ROMIO support: yes
>>>         IMPI support: no
>>>        Debug support: no
>>>         Purify clean: no
>>>             SSI boot: globus (API v1.1, Module v0.6)
>>>             SSI boot: rsh (API v1.1, Module v1.1)
>>>             SSI boot: slurm (API v1.1, Module v1.0)
>>>             SSI coll: lam_basic (API v1.1, Module v7.1)
>>>             SSI coll: shmem (API v1.1, Module v1.0)
>>>             SSI coll: smp (API v1.1, Module v1.2)
>>>              SSI rpi: crtcp (API v1.1, Module v1.1)
>>>              SSI rpi: lamd (API v1.0, Module v7.1)
>>>              SSI rpi: sysv (API v1.0, Module v7.1)
>>>              SSI rpi: tcp (API v1.0, Module v7.1)
>>>              SSI rpi: usysv (API v1.0, Module v7.1)
>>>               SSI cr: blcr (API v1.0, Module v1.1)
>>>               SSI cr: self (API v1.0, Module v1.0)
>>> --------------------------------------------------------
>>>
>>>
>>> My parallel code works well with lam without any checkpoint
>>> $ mpirun -np 2 ./job
>>>
>>> Then I run my parallel job in checkpointable way
>>> $ mpirun -np 2 -ssi cr blcr ./rotating
>>>
>>> And checkpoint this job in another window
>>> $ lamcheckpoint -ssi cr blcr -pid 11928
>>>
>>> This operation produces a context file for mpirun
>>>
>>> "context.mpirun.11928"
>>>
>>> plus two context files for the job
>>>
>>> "context.11928-n0-11929"
>>> "context.11928-n0-11930"
>>>
>>> Seems so far so good :)
>>> -------------------------------------------------------
>>>
>>> However, when I restart the job with the context file:
>>> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file
>>> ~/context.mpirun.11928
>>>
>>> I got the following error:
>>>
>>> Results CORRECT on rank 0  ["This line is the output in code"]
>>>
>>> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200,
>>> MPI_COMM_WORLD)
>>> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
>>> Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
>>> Rank (0, MPI_COMM_WORLD):  - main()
>>>
>>> -----------------------------------------------------------------------------
>>> It seems that [at least] one of the processes that was started with
>>> mpirun did not invoke MPI_INIT before quitting (it is possible that
>>> more than one process did not invoke MPI_INIT -- mpirun was only
>>> notified of the first one, which was on node n0).
>>>
>>> mpirun can *only* be used with MPI programs (i.e., programs that
>>> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
>>> to run non-MPI programs over the lambooted nodes.
>>>
>>> -----------------------------------------------------------------------------
>>>
>>> Anyone met this problem before and know how to solve it?
>>>
>>> Many Thanks
>>>
>>>
>>> --Yuan
>>>
>>>
>>> Yuan Wan
>>
>>
>>
> 


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
Next message: Tom Spyrou: "Restart Failed: permission denied"
Previous message: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
In reply to: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
Date view	Thread view	Subject view	Author view	Attachment view