Re: lam/mpi blcr problem

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Mar 22 2005 - 09:05:29 PST

  • Next message: Teemu Koponen: "About the planned features of BLCR (post 0.4.0)"
    I am sorry to hear that you are having problems.  Lets see if we can help.
    
    As far as I can tell your LAM configuration is OK, but I am cc:ing this 
    to one of the LAM developers who may be able to spot something I could not.
    
    Have you tried 'make check' in the blcr build directory or 
    checkpointing/restarting some of the non-mpi examples in blcr's examples 
    directory?  It would be good to know that the blcr build was OK before 
    bring LAM into the mix.
    
    When LAM ran the mpi application, was blcr installed (and the kernel 
    modules loaded) on all the compute nodes running the mpi job?
    
    -Paul
    
    ??? wrote:
    
    >I can not use blcr to checkpoint a MPI program. who can help me?
    >
    >I used the following command to configure blcr:
    >
    >/configure --prefix=/usr/local/blcr/ --with-linux=/usr/src/linux-2.4.20-8/
    >--with-system-map=/boot/System.map-2.4.20-8
    >
    >and used the following command to configure the lam/mpi:
    >
    >/configure --with-threads=posix --with-rpi=crtcp --with-cr-blcr=/usr/local/blcr/
    >--prefix=/usr/local/lam-7.1.1/ --with-rsh='ssh -x' 
    >
    >but when i use cr_checkpoint to deal with a MPI program, it doesn't generate
    >the checkpoint context file for each process, only generate a context file for 
    >the mpirun command, and when i use cr_restart to the uniq context, it says
    >
    >[rmingming@node01 lam]$ cr_restart context.5981
    >mpirun (rpwait): Bad file descriptor
    >[rmingming@node01 lam]$
    >
    >by the way, i followed the instuctin on this url:
    >http://mantis.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html
    >
    >the following is the laminfo output:
    > 
    >[rmingming@node01 lam]$ laminfo -all
    >             LAM/MPI: 7.1.1
    >            SSI boot: globus (SSI v1.0, API v1.1, Module v0.6)
    >            SSI boot: rsh (SSI v1.0, API v1.1, Module v1.1)
    >            SSI boot: slurm (SSI v1.0, API v1.1, Module v1.0)
    >            SSI boot: tm (SSI v1.0, API v1.1, Module v1.1)
    >            SSI coll: lam_basic (SSI v1.0, API v1.1, Module v7.1)
    >            SSI coll: shmem (SSI v1.0, API v1.1, Module v1.0)
    >            SSI coll: smp (SSI v1.0, API v1.1, Module v1.2)
    >             SSI rpi: crtcp (SSI v1.0, API v1.1, Module v1.1)
    >             SSI rpi: lamd (SSI v1.0, API v1.0, Module v7.1)
    >             SSI rpi: sysv (SSI v1.0, API v1.0, Module v7.1)
    >             SSI rpi: tcp (SSI v1.0, API v1.0, Module v7.1)
    >             SSI rpi: usysv (SSI v1.0, API v1.0, Module v7.1)
    >              SSI cr: blcr (SSI v1.0, API v1.0, Module v1.1)
    >              SSI cr: self (SSI v1.0, API v1.0, Module v1.0)
    >              Prefix: /usr/local/lam-7.1.1/
    >              Bindir: /usr/local/lam-7.1.1//bin
    >              Libdir: /usr/local/lam-7.1.1//lib
    >              Incdir: /usr/local/lam-7.1.1//include
    >           Pkglibdir: /usr/local/lam-7.1.1//lib/lam
    >          Sysconfdir: /usr/local/lam-7.1.1//etc
    >        Architecture: i686-pc-linux-gnu
    >       Configured by: root
    >       Configured on: Tue Mar 22 14:21:29 CST 2005
    >      Configure host: node01
    >      Memory manager: ptmalloc2
    >          C bindings: yes
    >        C++ bindings: yes
    >    Fortran bindings: yes
    >          C compiler: gcc
    >         C char size: 1
    >         C bool size: 1
    >        C short size: 2
    >          C int size: 4
    >         C long size: 4
    >        C float size: 4
    >       C double size: 8
    >      C pointer size: 4
    >        C char align: 1
    >        C bool align: 1
    >         C int align: 4
    >       C float align: 4
    >      C double align: 4
    >        C++ compiler: g++
    >    Fortran compiler: g77
    >     Fortran symbols: double_underscore
    >   Fort integer size: 4
    >      Fort real size: 4
    >  Fort dbl prec size: 4
    >      Fort cplx size: 4
    >  Fort dbl cplx size: 4
    >  Fort integer align: 4
    >     Fort real align: 4
    > Fort dbl prec align: 4
    >     Fort cplx align: 4
    > Fort dbl cplx align: 4
    >         C profiling: yes
    >       C++ profiling: yes
    >   Fortran profiling: yes
    >      C++ exceptions: no
    >      Thread support: yes
    >       ROMIO support: yes
    >        IMPI support: no
    >       Debug support: no
    >        Purify clean: no
    >            SSI base: parameter "verbose" (default value: <none>)
    >             SSI mpi: parameter "mpi_hostmap" (default value:
    >                      "/usr/local/lam-7.1.1//etc/lam-hostmap.txt")
    >            SSI base: parameter "base_module_path" (default value:
    >                      "/usr/local/lam-7.1.1//lib/lam")
    >            SSI boot: parameter "boot_verbose" (default value: <none>)
    >            SSI boot: parameter "boot" (default value: <none>)
    >            SSI boot: parameter "boot_base_promisc" (default value: "0")
    >            SSI boot: parameter "boot_base_window_size" (default value: "5")
    >            SSI boot: parameter "boot_globus_priority" (default value: "3")
    >            SSI boot: parameter "boot_rsh_username" (default value: <none>)
    >            SSI boot: parameter "boot_rsh_agent" (default value: "ssh -x")
    >            SSI boot: parameter "boot_rsh_no_n" (default value: "0")
    >            SSI boot: parameter "boot_rsh_no_profile" (default value: "0")
    >            SSI boot: parameter "boot_rsh_fast" (default value: "0")
    >            SSI boot: parameter "boot_rsh_ignore_stderr" (default value: "0")
    >            SSI boot: parameter "boot_rsh_priority" (default value: "10")
    >            SSI boot: parameter "boot_slurm_priority" (default value: "50")
    >            SSI boot: parameter "boot_tm_priority" (default value: "50")
    >            SSI boot: parameter "boot_tm_first" (default value: "-1")
    >             SSI rpi: parameter "rpi_verbose" (default value: <none>)
    >             SSI rpi: parameter "rpi" (default value: <none>)
    >             SSI rpi: parameter "rpi_crtcp_priority" (default value: "75")
    >             SSI rpi: parameter "rpi_crtcp_short" (default value: "65536")
    >             SSI rpi: parameter "rpi_crtcp_sockbuf" (default value: "-1")
    >             SSI rpi: parameter "rpi_lamd_priority" (default value: "20")
    >             SSI rpi: parameter "rpi_sysv_pollyield" (default value: "1")
    >             SSI rpi: parameter "rpi_sysv_poolsize" (default value:
    >                      "16777216")
    >             SSI rpi: parameter "rpi_sysv_maxalloc" (default value:
    >                      "1048576")
    >             SSI rpi: parameter "rpi_sysv_short" (default value: "8192")
    >             SSI rpi: parameter "rpi_tcp_short" (default value: "65536")
    >             SSI rpi: parameter "rpi_tcp_sockbuf" (default value: "-1")
    >             SSI rpi: parameter "rpi_sysv_priority" (default value: "30")
    >             SSI rpi: parameter "rpi_tcp_priority" (default value: "20")
    >             SSI rpi: parameter "rpi_usysv_readlockpoll" (default value:
    >                      "10000")
    >             SSI rpi: parameter "rpi_usysv_writelockpoll" (default value:
    >                      "10")
    >             SSI rpi: parameter "rpi_usysv_pollyield" (default value: "1")
    >             SSI rpi: parameter "rpi_usysv_poolsize" (default value:
    >                      "16777216")
    >             SSI rpi: parameter "rpi_usysv_maxalloc" (default value:
    >                      "1048576")
    >             SSI rpi: parameter "rpi_usysv_short" (default value: "8192")
    >             SSI rpi: parameter "rpi_usysv_priority" (default value: "40")
    >            SSI coll: parameter "coll_verbose" (default value: <none>)
    >            SSI coll: parameter "coll_shmem" (default value: "0")
    >              SSI cr: parameter "cr_verbose" (default value: <none>)
    >              SSI cr: parameter "cr" (default value: <none>)
    >              SSI cr: parameter "cr_blcr_priority" (default value: "50")
    >              SSI cr: parameter "cr_self_priority" (default value: "25")
    >              SSI cr: parameter "cr_self_do_restart" (default value: "0")
    >              SSI cr: parameter "cr_self_prefix" (default value:
    >                      "lam_cr_self")
    >              SSI cr: parameter "cr_self_checkpoint" (default value: <none>)
    >              SSI cr: parameter "cr_self_continue" (default value: <none>)
    >              SSI cr: parameter "cr_self_restart" (default value: <none>)
    >
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Teemu Koponen: "About the planned features of BLCR (post 0.4.0)"