Re: problem: checkpoint lam/mpi with BLCR

From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Tue Mar 27 2007 - 00:49:55 PST

  • Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
    On Mon, 26 Mar 2007, Paul H. Hargrove wrote:
    
    Hi Paul,
    
    Thanks for your reply.
    
    I have tried to explicitly use "crtcp" module, but it caused a
    failure on checkpoint:
    
    $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp ./rotating
    $ lamcheckpoint -ssi cr blcr -pid 17256
    
    -----------------------------------------------------------------------------
    Encountered a failure in the SSI types while continuing from
    checkpoint.  Aborting in despair :-(
    -----------------------------------------------------------------------------
    And The code never exit after it getting the end.
    I check the 'ps' list and found there are two 'mpirun' and 
    three'checkpoint'processes running:
    ---------------------------------------
    17255 ?        00:00:00 lamd
    17256 pts/2    00:00:00 mpirun
    17257 ?        00:00:15 rotating
    17258 ?        00:00:15 rotating
    17263 pts/3    00:00:00 lamcheckpoint
    17264 pts/3    00:00:00 cr_checkpoint
    17265 pts/2    00:00:00 mpirun
    17266 ?        00:00:00 cr_checkpoint
    17267 ?        00:00:00 cr_checkpoint
    ---------------------------------------
    
    --Yuan
    
    
    
    >
    > Yuan,
    >
    > I've not encountered this problem before.  It looks as if something is 
    > triggering a LAM-internal error message.  It is possible that this is a 
    > result of a BLCR problem, or it could be a LAM/MPI problem.  If the problem 
    > *is* in BLCR, then there is not enough information here to try to find it.
    > I see that you have also asked on the LAM/MPI mailing list, and that Josh 
    > Hursey made a suggestion there.  I am monitoring that thread and will make 
    > any BLCR-specific comments if I can.  However, at this point I don't have any 
    > ideas beyond Josh's suggestion to explicitly set the rpi module to crtcp.
    >
    > -Paul
    >
    > Yuan Wan wrote:
    >> 
    >> Hi all,
    >> 
    >> I got some problem when checkpointing lam/mpi code using blcr.
    >> 
    >> My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
    >> I have built blcr-0.5.0 and it works well with serial codes.
    >> 
    >> I built LAM/MPI 7.1.2
    >> ---------------------------------------------
    >> $ ./configure --prefix=/home/pst/lam
    >>             --with-rsh="ssh -x"
    >>             --with-cr-blcr=/home/pst/blcr $ make
    >> $ make install
    >> ---------------------------------------------
    >> 
    >> The laminfo output is
    >> -----------------------------------------------------
    >>              LAM/MPI: 7.1.2
    >>               Prefix: /home/pst/lam
    >>         Architecture: i686-pc-linux-gnu
    >>        Configured by: pst
    >>        Configured on: Sat Mar 24 00:40:42 GMT 2007
    >>       Configure host: master00
    >>       Memory manager: ptmalloc2
    >>           C bindings: yes
    >>         C++ bindings: yes
    >>     Fortran bindings: yes
    >>           C compiler: gcc
    >>         C++ compiler: g++
    >>     Fortran compiler: g77
    >>      Fortran symbols: double_underscore
    >>          C profiling: yes
    >>        C++ profiling: yes
    >>    Fortran profiling: yes
    >>       C++ exceptions: no
    >>       Thread support: yes
    >>        ROMIO support: yes
    >>         IMPI support: no
    >>        Debug support: no
    >>         Purify clean: no
    >>             SSI boot: globus (API v1.1, Module v0.6)
    >>             SSI boot: rsh (API v1.1, Module v1.1)
    >>             SSI boot: slurm (API v1.1, Module v1.0)
    >>             SSI coll: lam_basic (API v1.1, Module v7.1)
    >>             SSI coll: shmem (API v1.1, Module v1.0)
    >>             SSI coll: smp (API v1.1, Module v1.2)
    >>              SSI rpi: crtcp (API v1.1, Module v1.1)
    >>              SSI rpi: lamd (API v1.0, Module v7.1)
    >>              SSI rpi: sysv (API v1.0, Module v7.1)
    >>              SSI rpi: tcp (API v1.0, Module v7.1)
    >>              SSI rpi: usysv (API v1.0, Module v7.1)
    >>               SSI cr: blcr (API v1.0, Module v1.1)
    >>               SSI cr: self (API v1.0, Module v1.0)
    >> --------------------------------------------------------
    >> 
    >> 
    >> My parallel code works well with lam without any checkpoint
    >> $ mpirun -np 2 ./job
    >> 
    >> Then I run my parallel job in checkpointable way
    >> $ mpirun -np 2 -ssi cr blcr ./rotating
    >> 
    >> And checkpoint this job in another window
    >> $ lamcheckpoint -ssi cr blcr -pid 11928
    >> 
    >> This operation produces a context file for mpirun
    >> 
    >> "context.mpirun.11928"
    >> 
    >> plus two context files for the job
    >> 
    >> "context.11928-n0-11929"
    >> "context.11928-n0-11930"
    >> 
    >> Seems so far so good :)
    >> -------------------------------------------------------
    >> 
    >> However, when I restart the job with the context file:
    >> $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928
    >> 
    >> I got the following error:
    >> 
    >> Results CORRECT on rank 0  ["This line is the output in code"]
    >> 
    >> MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
    >> MPI_COMM_WORLD)
    >> Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    >> Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
    >> Rank (0, MPI_COMM_WORLD):  - main()
    >> 
    >> ----------------------------------------------------------------------------- 
    >> It seems that [at least] one of the processes that was started with
    >> mpirun did not invoke MPI_INIT before quitting (it is possible that
    >> more than one process did not invoke MPI_INIT -- mpirun was only
    >> notified of the first one, which was on node n0).
    >> 
    >> mpirun can *only* be used with MPI programs (i.e., programs that
    >> invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
    >> to run non-MPI programs over the lambooted nodes.
    >> 
    >> ----------------------------------------------------------------------------- 
    >> 
    >> Anyone met this problem before and know how to solve it?
    >> 
    >> Many Thanks
    >> 
    >> 
    >> --Yuan
    >> 
    >> 
    >> Yuan Wan
    >
    >
    >
    
    -- 
    Unix Section
    Information Services Infrastructure Division
    University of Edinburgh
    
    tel: 0131 650 4985
    email: ywan@ed.ac.uk
    
    2032 Computing Services, JCMB
    The King's Buildings,
    Edinburgh, EH9 3JZ
    

  • Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"