Re: problem: checkpoint lam/mpi with BLCR

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Mar 26 2007 - 09:58:55 PST

  • Next message: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"
    Yuan,
    
      I've not encountered this problem before.  It looks as if something is 
    triggering a LAM-internal error message.  It is possible that this is a 
    result of a BLCR problem, or it could be a LAM/MPI problem.  If the 
    problem *is* in BLCR, then there is not enough information here to try 
    to find it.
      I see that you have also asked on the LAM/MPI mailing list, and that 
    Josh Hursey made a suggestion there.  I am monitoring that thread and 
    will make any BLCR-specific comments if I can.  However, at this point I 
    don't have any ideas beyond Josh's suggestion to explicitly set the rpi 
    module to crtcp.
    
    -Paul
    
    Yuan Wan wrote:
    >
    > Hi all,
    >
    > I got some problem when checkpointing lam/mpi code using blcr.
    >
    > My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
    > I have built blcr-0.5.0 and it works well with serial codes.
    >
    > I built LAM/MPI 7.1.2
    > ---------------------------------------------
    > $ ./configure --prefix=/home/pst/lam
    >             --with-rsh="ssh -x"
    >             --with-cr-blcr=/home/pst/blcr $ make
    > $ make install
    > ---------------------------------------------
    >
    > The laminfo output is
    > -----------------------------------------------------
    >              LAM/MPI: 7.1.2
    >               Prefix: /home/pst/lam
    >         Architecture: i686-pc-linux-gnu
    >        Configured by: pst
    >        Configured on: Sat Mar 24 00:40:42 GMT 2007
    >       Configure host: master00
    >       Memory manager: ptmalloc2
    >           C bindings: yes
    >         C++ bindings: yes
    >     Fortran bindings: yes
    >           C compiler: gcc
    >         C++ compiler: g++
    >     Fortran compiler: g77
    >      Fortran symbols: double_underscore
    >          C profiling: yes
    >        C++ profiling: yes
    >    Fortran profiling: yes
    >       C++ exceptions: no
    >       Thread support: yes
    >        ROMIO support: yes
    >         IMPI support: no
    >        Debug support: no
    >         Purify clean: no
    >             SSI boot: globus (API v1.1, Module v0.6)
    >             SSI boot: rsh (API v1.1, Module v1.1)
    >             SSI boot: slurm (API v1.1, Module v1.0)
    >             SSI coll: lam_basic (API v1.1, Module v7.1)
    >             SSI coll: shmem (API v1.1, Module v1.0)
    >             SSI coll: smp (API v1.1, Module v1.2)
    >              SSI rpi: crtcp (API v1.1, Module v1.1)
    >              SSI rpi: lamd (API v1.0, Module v7.1)
    >              SSI rpi: sysv (API v1.0, Module v7.1)
    >              SSI rpi: tcp (API v1.0, Module v7.1)
    >              SSI rpi: usysv (API v1.0, Module v7.1)
    >               SSI cr: blcr (API v1.0, Module v1.1)
    >               SSI cr: self (API v1.0, Module v1.0)
    > --------------------------------------------------------
    >
    >
    > My parallel code works well with lam without any checkpoint
    > $ mpirun -np 2 ./job
    >
    > Then I run my parallel job in checkpointable way
    > $ mpirun -np 2 -ssi cr blcr ./rotating
    >
    > And checkpoint this job in another window
    > $ lamcheckpoint -ssi cr blcr -pid 11928
    >
    > This operation produces a context file for mpirun
    >
    > "context.mpirun.11928"
    >
    > plus two context files for the job
    >
    > "context.11928-n0-11929"
    > "context.11928-n0-11930"
    >
    > Seems so far so good :)
    > -------------------------------------------------------
    >
    > However, when I restart the job with the context file:
    > $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file 
    > ~/context.mpirun.11928
    >
    > I got the following error:
    >
    > Results CORRECT on rank 0  ["This line is the output in code"]
    >
    > MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
    > MPI_COMM_WORLD)
    > Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    > Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
    > Rank (0, MPI_COMM_WORLD):  - main()
    > ----------------------------------------------------------------------------- 
    >
    > It seems that [at least] one of the processes that was started with
    > mpirun did not invoke MPI_INIT before quitting (it is possible that
    > more than one process did not invoke MPI_INIT -- mpirun was only
    > notified of the first one, which was on node n0).
    >
    > mpirun can *only* be used with MPI programs (i.e., programs that
    > invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
    > to run non-MPI programs over the lambooted nodes.
    > ----------------------------------------------------------------------------- 
    >
    >
    > Anyone met this problem before and know how to solve it?
    >
    > Many Thanks
    >
    >
    > --Yuan
    >
    >
    > Yuan Wan
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Yuan Wan: "Re: problem: checkpoint lam/mpi with BLCR"