Re: problem: checkpoint lam/mpi with BLCR

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Mar 26 2007 - 09:58:55 PST

Next message: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"

Previous message: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"
In reply to: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"
Next in thread: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
Reply: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"

Yuan,

  I've not encountered this problem before.  It looks as if something is 
triggering a LAM-internal error message.  It is possible that this is a 
result of a BLCR problem, or it could be a LAM/MPI problem.  If the 
problem *is* in BLCR, then there is not enough information here to try 
to find it.
  I see that you have also asked on the LAM/MPI mailing list, and that 
Josh Hursey made a suggestion there.  I am monitoring that thread and 
will make any BLCR-specific comments if I can.  However, at this point I 
don't have any ideas beyond Josh's suggestion to explicitly set the rpi 
module to crtcp.

-Paul

Yuan Wan wrote:
>
> Hi all,
>
> I got some problem when checkpointing lam/mpi code using blcr.
>
> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
> I have built blcr-0.5.0 and it works well with serial codes.
>
> I built LAM/MPI 7.1.2
> ---------------------------------------------
> $ ./configure --prefix=/home/pst/lam
>             --with-rsh="ssh -x"
>             --with-cr-blcr=/home/pst/blcr $ make
> $ make install
> ---------------------------------------------
>
> The laminfo output is
> -----------------------------------------------------
>              LAM/MPI: 7.1.2
>               Prefix: /home/pst/lam
>         Architecture: i686-pc-linux-gnu
>        Configured by: pst
>        Configured on: Sat Mar 24 00:40:42 GMT 2007
>       Configure host: master00
>       Memory manager: ptmalloc2
>           C bindings: yes
>         C++ bindings: yes
>     Fortran bindings: yes
>           C compiler: gcc
>         C++ compiler: g++
>     Fortran compiler: g77
>      Fortran symbols: double_underscore
>          C profiling: yes
>        C++ profiling: yes
>    Fortran profiling: yes
>       C++ exceptions: no
>       Thread support: yes
>        ROMIO support: yes
>         IMPI support: no
>        Debug support: no
>         Purify clean: no
>             SSI boot: globus (API v1.1, Module v0.6)
>             SSI boot: rsh (API v1.1, Module v1.1)
>             SSI boot: slurm (API v1.1, Module v1.0)
>             SSI coll: lam_basic (API v1.1, Module v7.1)
>             SSI coll: shmem (API v1.1, Module v1.0)
>             SSI coll: smp (API v1.1, Module v1.2)
>              SSI rpi: crtcp (API v1.1, Module v1.1)
>              SSI rpi: lamd (API v1.0, Module v7.1)
>              SSI rpi: sysv (API v1.0, Module v7.1)
>              SSI rpi: tcp (API v1.0, Module v7.1)
>              SSI rpi: usysv (API v1.0, Module v7.1)
>               SSI cr: blcr (API v1.0, Module v1.1)
>               SSI cr: self (API v1.0, Module v1.0)
> --------------------------------------------------------
>
>
> My parallel code works well with lam without any checkpoint
> $ mpirun -np 2 ./job
>
> Then I run my parallel job in checkpointable way
> $ mpirun -np 2 -ssi cr blcr ./rotating
>
> And checkpoint this job in another window
> $ lamcheckpoint -ssi cr blcr -pid 11928
>
> This operation produces a context file for mpirun
>
> "context.mpirun.11928"
>
> plus two context files for the job
>
> "context.11928-n0-11929"
> "context.11928-n0-11930"
>
> Seems so far so good :)
> -------------------------------------------------------
>
> However, when I restart the job with the context file:
> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file 
> ~/context.mpirun.11928
>
> I got the following error:
>
> Results CORRECT on rank 0  ["This line is the output in code"]
>
> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
> MPI_COMM_WORLD)
> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
> Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
> Rank (0, MPI_COMM_WORLD):  - main()
> ----------------------------------------------------------------------------- 
>
> It seems that [at least] one of the processes that was started with
> mpirun did not invoke MPI_INIT before quitting (it is possible that
> more than one process did not invoke MPI_INIT -- mpirun was only
> notified of the first one, which was on node n0).
>
> mpirun can *only* be used with MPI programs (i.e., programs that
> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
> to run non-MPI programs over the lambooted nodes.
> ----------------------------------------------------------------------------- 
>
>
> Anyone met this problem before and know how to solve it?
>
> Many Thanks
>
>
> --Yuan
>
>
> Yuan Wan


-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"

Previous message: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"
In reply to: Yuan Wan: "problem: checkpoint lam/mpi with BLCR"
Next in thread: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
Reply: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"

Date view	Thread view	Subject view	Author view	Attachment view