From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 20 2007 - 10:38:35 PDT
Patrice, In general, questions involving checkpointing with MVAPICH2 should be sent to the MVAPICH2 folks (you may have also done that, I am not certain). However, in this case I am pretty sure I can guess the most likely problems. My guess is that LD_LIBRARY_PATH and/or LD_PRELOAD have been set on the "front end" but not on the machines where mpd will spawn the MPI application processes, or they have been set in a manner (such as .login files) that mpd is not reading on the "remote" nodes. I understand that in your case this is a single machine, but it is still possible that the environment variables are not set in the context of the mpd daemons that actually spawn MPI application processes. You may try a command like mpdrun -n 1 env | grep LD_ to see what values (if any) the LD_LIBRARY_PATH and LD_PRELOAD variables have in processes spawned by mpd. If my guess above isn't correct or doesn't provide enough information for you to resolve the problem, then you will need to ask the MVAPICH2 (or MPICH2) folks about how mpd is handling the environment for the spawned processes. Sorry I cannot be of more help, but since ldd finds the libraries I can only assume the problem is related to how mpd is starting the processes. -Paul Patrice Martinez wrote: > Hello, > I'm trying to run linpack benchmark using blcr and mvapich2 (and > Infiniband). > > I'm using: > blcr-0.6.0, > mvapich2-1.0 compiled with blcr support > OFED-1.2.5.1, > linpack linked with pvapich, and ofed libs > > I'm using (for this test) a single em64t computer, running a 2.6.21 > kernel above a RHEL U4: > uname -a > Linux twing 2.6.21.5 #1 SMP Wed Jun 13 10:29:09 CEST 2007 x86_64 > x86_64 x86_64 GNU/Linux > > BLCR is compiled with this kernel, the modules are inserted, and the > following env vars are set as follows: > > echo $LD_LIBRARY_PATH > /opt/intel/fce/10.0.023/lib:/opt/intel/cc/9.1.039/lib::/usr/local/lib > > export LD_PRELOAD=/usr/local/lib/libcr.so.0:/lib64/tls/libpthread.so.0 > > > I started mpdboot: > mpdboot --ncpus=4 > > Then I try to run linpack: > mpiexec -n 4 ./xhpl > /usr/local/bin/mpdroot: error while loading shared libraries: > libcr.so.0: cannot open shared object file: No such file or directory > mpiexec_twing (__init__ 1171): forked process failed; status=127 > CTRL+C Caught... exiting > > It doesn't work, however, the libcr is located: > > ldd /usr/local/bin/mpdroot > /usr/local/lib/libcr.so (0x0000002a95557000) > /lib64/tls/libpthread.so.0 (0x000000323fa00000) > libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x0000002a95690000) > libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x0000002a9579b000) > libc.so.6 => /lib64/tls/libc.so.6 (0x000000323ef00000) > libdl.so.2 => /lib64/libdl.so.2 (0x000000323ed00000) > /lib64/ld-linux-x86-64.so.2 (0x000000323eb00000) > libibcommon.so.1 => /usr/lib64/libibcommon.so.1 > (0x0000002a958a6000) > > > ldd ./xhpl > libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x000000323fa00000) > libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00002b4a28296000) > libibumad.so.1 => /usr/lib64/libibumad.so.1 (0x00002b4a283a1000) > libcr.so.0 => /usr/local/lib/libcr.so.0 (0x00002b4a284ab000) > libc.so.6 => /lib64/tls/libc.so.6 (0x000000323ef00000) > /lib64/ld-linux-x86-64.so.2 (0x000000323eb00000) > libdl.so.2 => /lib64/libdl.so.2 (0x000000323ed00000) > libibcommon.so.1 => /usr/lib64/libibcommon.so.1 > (0x00002b4a285b4000) > > > Any idea about whta's wrong with this? > > Linpacks runs well if linked with a release of mvapich2 compiled > without blcr support. > -- > > Cordialement/Best regards > > Patrice Martinez > > Linux Kernel Architect. > Bull, Architect of an Open World > > OFFICE : B1-405 > PHONE : +33 (0)4 76 29 74 69 > EMAIL : Patrice.martinez_at_bull_dot_net > ADDR : BULL, 1 rue de Provence, BP 208, 38432 Echirolles Cedex, FRANCE > > Bull recrute : http://www.bull.fr/emploi > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900