problem: checkpoint lam/mpi with BLCR

From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Mon Mar 26 2007 - 07:46:54 PST

  • Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
    Hi all,
    
    I got some problem when checkpointing lam/mpi code using blcr.
    
    My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
    I have built blcr-0.5.0 and it works well with serial codes.
    
    I built LAM/MPI 7.1.2
    ---------------------------------------------
    $ ./configure --prefix=/home/pst/lam
                 --with-rsh="ssh -x"
                 --with-cr-blcr=/home/pst/blcr 
    $ make
    $ make install
    ---------------------------------------------
    
    The laminfo output is
    -----------------------------------------------------
                  LAM/MPI: 7.1.2
                   Prefix: /home/pst/lam
             Architecture: i686-pc-linux-gnu
            Configured by: pst
            Configured on: Sat Mar 24 00:40:42 GMT 2007
           Configure host: master00
           Memory manager: ptmalloc2
               C bindings: yes
             C++ bindings: yes
         Fortran bindings: yes
               C compiler: gcc
             C++ compiler: g++
         Fortran compiler: g77
          Fortran symbols: double_underscore
              C profiling: yes
            C++ profiling: yes
        Fortran profiling: yes
           C++ exceptions: no
           Thread support: yes
            ROMIO support: yes
             IMPI support: no
            Debug support: no
             Purify clean: no
                 SSI boot: globus (API v1.1, Module v0.6)
                 SSI boot: rsh (API v1.1, Module v1.1)
                 SSI boot: slurm (API v1.1, Module v1.0)
                 SSI coll: lam_basic (API v1.1, Module v7.1)
                 SSI coll: shmem (API v1.1, Module v1.0)
                 SSI coll: smp (API v1.1, Module v1.2)
                  SSI rpi: crtcp (API v1.1, Module v1.1)
                  SSI rpi: lamd (API v1.0, Module v7.1)
                  SSI rpi: sysv (API v1.0, Module v7.1)
                  SSI rpi: tcp (API v1.0, Module v7.1)
                  SSI rpi: usysv (API v1.0, Module v7.1)
                   SSI cr: blcr (API v1.0, Module v1.1)
                   SSI cr: self (API v1.0, Module v1.0)
    --------------------------------------------------------
    
    
    My parallel code works well with lam without any checkpoint
    $ mpirun -np 2 ./job
    
    Then I run my parallel job in checkpointable way
    $ mpirun -np 2 -ssi cr blcr ./rotating
    
    And checkpoint this job in another window
    $ lamcheckpoint -ssi cr blcr -pid 11928
    
    This operation produces a context file for mpirun
    
    "context.mpirun.11928"
    
    plus two context files for the job
    
    "context.11928-n0-11929"
    "context.11928-n0-11930"
    
    Seems so far so good :)
    -------------------------------------------------------
    
    However, when I restart the job with the context file:
    $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928
    
    I got the following error:
    
    Results CORRECT on rank 0  ["This line is the output in code"]
    
    MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
    MPI_COMM_WORLD)
    Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
    Rank (0, MPI_COMM_WORLD):  - main()
    -----------------------------------------------------------------------------
    It seems that [at least] one of the processes that was started with
    mpirun did not invoke MPI_INIT before quitting (it is possible that
    more than one process did not invoke MPI_INIT -- mpirun was only
    notified of the first one, which was on node n0).
    
    mpirun can *only* be used with MPI programs (i.e., programs that
    invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
    to run non-MPI programs over the lambooted nodes.
    -----------------------------------------------------------------------------
    
    Anyone met this problem before and know how to solve it?
    
    Many Thanks
    
    
    --Yuan
    
    
    Yuan Wan
    -- 
    Unix Section
    Information Services Infrastructure Division
    University of Edinburgh
    
    tel: 0131 650 4985
    email: [email protected]
    
    2032 Computing Services, JCMB
    The King's Buildings,
    Edinburgh, EH9 3JZ
    

  • Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"