problem: checkpoint lam/mpi with BLCR

Date: Mon Mar 26 2007 - 07:46:54 PST

    Hi all,
    I got some problem when checkpointing lam/mpi code using blcr.
    My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
    I have built blcr-0.5.0 and it works well with serial codes.
    I built LAM/MPI 7.1.2
    $ ./configure --prefix=/home/pst/lam
                 --with-rsh="ssh -x"
    $ make
    $ make install
    The laminfo output is
                  LAM/MPI: 7.1.2
                   Prefix: /home/pst/lam
             Architecture: i686-pc-linux-gnu
            Configured by: pst
            Configured on: Sat Mar 24 00:40:42 GMT 2007
           Configure host: master00
           Memory manager: ptmalloc2
               C bindings: yes
             C++ bindings: yes
         Fortran bindings: yes
               C compiler: gcc
             C++ compiler: g++
         Fortran compiler: g77
          Fortran symbols: double_underscore
              C profiling: yes
            C++ profiling: yes
        Fortran profiling: yes
           C++ exceptions: no
           Thread support: yes
            ROMIO support: yes
             IMPI support: no
            Debug support: no
             Purify clean: no
                 SSI boot: globus (API v1.1, Module v0.6)
                 SSI boot: rsh (API v1.1, Module v1.1)
                 SSI boot: slurm (API v1.1, Module v1.0)
                 SSI coll: lam_basic (API v1.1, Module v7.1)
                 SSI coll: shmem (API v1.1, Module v1.0)
                 SSI coll: smp (API v1.1, Module v1.2)
                  SSI rpi: crtcp (API v1.1, Module v1.1)
                  SSI rpi: lamd (API v1.0, Module v7.1)
                  SSI rpi: sysv (API v1.0, Module v7.1)
                  SSI rpi: tcp (API v1.0, Module v7.1)
                  SSI rpi: usysv (API v1.0, Module v7.1)
                   SSI cr: blcr (API v1.0, Module v1.1)
                   SSI cr: self (API v1.0, Module v1.0)
    My parallel code works well with lam without any checkpoint
    $ mpirun -np 2 ./job
    Then I run my parallel job in checkpointable way
    $ mpirun -np 2 -ssi cr blcr ./rotating
    And checkpoint this job in another window
    $ lamcheckpoint -ssi cr blcr -pid 11928
    This operation produces a context file for mpirun
    plus two context files for the job
    Seems so far so good :)
    However, when I restart the job with the context file:
    $ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928
    I got the following error:
    Results CORRECT on rank 0  ["This line is the output in code"]
    MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
    Rank (0, MPI_COMM_WORLD): Call stack within LAM:
    Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
    Rank (0, MPI_COMM_WORLD):  - main()
    It seems that [at least] one of the processes that was started with
    mpirun did not invoke MPI_INIT before quitting (it is possible that
    more than one process did not invoke MPI_INIT -- mpirun was only
    notified of the first one, which was on node n0).
    mpirun can *only* be used with MPI programs (i.e., programs that
    invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
    to run non-MPI programs over the lambooted nodes.
    Anyone met this problem before and know how to solve it?
    Many Thanks
