problem: checkpoint lam/mpi with BLCR

Date view	Thread view	Subject view	Author view	Attachment view

From: Yuan Wan (ywan_at_ed.ac.uk)
Date: Mon Mar 26 2007 - 07:46:54 PST

Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Previous message: Paul H. Hargrove: "Re: Simple API usage"
Next in thread: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
Reply: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Hi all,

I got some problem when checkpointing lam/mpi code using blcr.

My platform is a 2-cpu machine running Fedora Core 6 (kernel 2.6.19)
I have built blcr-0.5.0 and it works well with serial codes.

I built LAM/MPI 7.1.2
---------------------------------------------
$ ./configure --prefix=/home/pst/lam
             --with-rsh="ssh -x"
             --with-cr-blcr=/home/pst/blcr 
$ make
$ make install
---------------------------------------------

The laminfo output is
-----------------------------------------------------
              LAM/MPI: 7.1.2
               Prefix: /home/pst/lam
         Architecture: i686-pc-linux-gnu
        Configured by: pst
        Configured on: Sat Mar 24 00:40:42 GMT 2007
       Configure host: master00
       Memory manager: ptmalloc2
           C bindings: yes
         C++ bindings: yes
     Fortran bindings: yes
           C compiler: gcc
         C++ compiler: g++
     Fortran compiler: g77
      Fortran symbols: double_underscore
          C profiling: yes
        C++ profiling: yes
    Fortran profiling: yes
       C++ exceptions: no
       Thread support: yes
        ROMIO support: yes
         IMPI support: no
        Debug support: no
         Purify clean: no
             SSI boot: globus (API v1.1, Module v0.6)
             SSI boot: rsh (API v1.1, Module v1.1)
             SSI boot: slurm (API v1.1, Module v1.0)
             SSI coll: lam_basic (API v1.1, Module v7.1)
             SSI coll: shmem (API v1.1, Module v1.0)
             SSI coll: smp (API v1.1, Module v1.2)
              SSI rpi: crtcp (API v1.1, Module v1.1)
              SSI rpi: lamd (API v1.0, Module v7.1)
              SSI rpi: sysv (API v1.0, Module v7.1)
              SSI rpi: tcp (API v1.0, Module v7.1)
              SSI rpi: usysv (API v1.0, Module v7.1)
               SSI cr: blcr (API v1.0, Module v1.1)
               SSI cr: self (API v1.0, Module v1.0)
--------------------------------------------------------


My parallel code works well with lam without any checkpoint
$ mpirun -np 2 ./job

Then I run my parallel job in checkpointable way
$ mpirun -np 2 -ssi cr blcr ./rotating

And checkpoint this job in another window
$ lamcheckpoint -ssi cr blcr -pid 11928

This operation produces a context file for mpirun

"context.mpirun.11928"

plus two context files for the job

"context.11928-n0-11929"
"context.11928-n0-11930"

Seems so far so good :)
-------------------------------------------------------

However, when I restart the job with the context file:
$ lamrestart -ssi cr blcr -ssi cr_blcr_context_file ~/context.mpirun.11928

I got the following error:

Results CORRECT on rank 0  ["This line is the output in code"]

MPI_Finalize: internal MPI error: Invalid argument (rank 137389200, 
MPI_COMM_WORLD)
Rank (0, MPI_COMM_WORLD): Call stack within LAM:
Rank (0, MPI_COMM_WORLD):  - MPI_Finalize()
Rank (0, MPI_COMM_WORLD):  - main()
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------

Anyone met this problem before and know how to solve it?

Many Thanks


--Yuan


Yuan Wan
-- 
Unix Section
Information Services Infrastructure Division
University of Edinburgh

tel: 0131 650 4985
email: [email protected]

2032 Computing Services, JCMB
The King's Buildings,
Edinburgh, EH9 3JZ

Next message: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Previous message: Paul H. Hargrove: "Re: Simple API usage"
Next in thread: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"
Reply: Paul H. Hargrove: "Re: problem: checkpoint lam/mpi with BLCR"

Date view	Thread view	Subject view	Author view	Attachment view