problem: checkpoint lam/mpi with BLCR

From: Yuan Wan (
Date: Mon Mar 26 2007 - 07:46:54 PST

  • Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
    Hi all,
    I got some problem when checkpointing lam/mpi code using blcr.
    My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
    I have built blcr-0.5.0 and it works well with serial codes.
    I built LAM/MPI 7.1.2
    $ ./configure --prefix=/home/pst/lam
                 --with-rsh="ssh -x"
    $ make
    $ make install
    The laminfo output is
                  LAM/MPI: 7.1.2
                   Prefix: /home/pst/lam
             Architecture: i686-pc-linux-gnu
            Configured by: pst
            Configured on: Sat Mar 24 00:40:42 GMT 2007
           Configure host: master00
           Memory manager: ptmalloc2
               C bindings: yes
             C++ bindings: yes
         Fortran bindings: yes
               C compiler: gcc
             C++ compiler: g++
         Fortran compiler: g77
          Fortran symbols: double_underscore
              C profiling: yes
            C++ profiling: yes
        Fortran profiling: yes
           C++ exceptions: no
           Thread support: yes
            ROMIO support: yes
             IMPI support: no
            Debug support: no
             Purify clean: no
                 SSI boot: globus (API v1.1, Module v0.6)
                 SSI boot: rsh (API v1.1, Module v1.1)
                 SSI boot: slurm (API v1.1, Module v1.0)
                 SSI coll: lam_basic (API v1.1, Module v7.1)
                 SSI coll: shmem (API v1.1, Module v1.0)
                 SSI coll: smp (API v1.1, Module v1.2)
                  SSI rpi: crtcp (API v1.1, Module v1.1)
                  SSI rpi: lamd (API v1.0, Module v7.1)
                  SSI rpi: sysv (API v1.0, Module v7.1)
                  SSI rpi: tcp (API v1.0, Module v7.1)
                  SSI rpi: usysv (API v1.0, Module v7.1)
                   SSI cr: blcr (API v1.0, Module v1.1)
                   SSI cr: self (API v1.0, Module v1.0)
    My parallel code works well with lam without any checkpoint
    $ mpirun -np 2 ./job
    Then I run my parallel job in checkpointable way
    $ mpirun -np 2 -ssi cr blcr ./rotating
    And checkpoint this job in another window
    $ lamcheckpoint -ssi cr blcr -pid 11928
    This operation produces a context file for mpirun
    plus two context files for the job
    Seems so far so good :)
    However, when I restart the job with the context file:
    $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928
    I got the following error:
    Results CORRECT on rank 0  ["This line is the output in code"]
    MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
    Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
    Rank (0, MPI_COMM_WORLD):  - main()
    It seems that [at least] one of the processes that was started with
    mpirun did not invoke MPI_INIT before quitting (it is possible that
    more than one process did not invoke MPI_INIT -- mpirun was only
    notified of the first one, which was on node n0).
    mpirun can *only* be used with MPI programs (i.e., programs that
    invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
    to run non-MPI programs over the lambooted nodes.
    Anyone met this problem before and know how to solve it?
    Many Thanks
    Yuan Wan
    Unix Section
    Information Services Infrastructure Division
    University of Edinburgh
    tel: 0131 650 4985
    2032 Computing Services, JCMB
    The King's Buildings,
    Edinburgh, EH9 3JZ

  • Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"