From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon Nov 09 2009 - 04:26:27 PST
Hi Alan, Using the -v option, I got the same error. But now, I got an extra error... Maybe I'll follow the case with Josh because it seems a opempi problem. [root@compute-3-18 ~]# root 15841 0.0 0.0 4468 1224 ? S 12:56 0:00 \_ sge_shepherd-2726941 -bg sdiaz 15869 0.0 0.0 53164 1220 ? Ss 12:56 0:00 \_ -bash /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726941 sdiaz 15900 0.0 0.0 41028 2480 ? S 12:56 0:00 \_ mpirun -np 2 -am ft-enable-cr ./pi3 sdiaz 15901 0.0 0.0 36484 1844 ? Sl 12:56 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit -nostdin -V compute-3-17.local sdiaz 15904 0.0 0.0 99464 4616 ? Sl 12:56 0:00 \_ ./pi3 [root@compute-3-17 ~]# root 29855 0.0 0.0 66132 1692 ? Sl 12:56 0:00 \_ sge_shepherd-2726941 -bg sdiaz 29856 0.0 0.0 1888 560 ? Ss 12:56 0:00 \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/default/spool/comp sdiaz 29863 0.0 0.0 35728 2260 ? S 12:56 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2759065600 -mca orte_ess_vpid 1 -mca or sdiaz 29864 0.1 0.0 99452 4596 ? Sl 12:56 0:00 \_ ./pi3 [root@compute-3-18 ~]# ompi-checkpoint 15900 [compute-3-18.local:15986] [[42010,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 [compute-3-18.local:15986] HNP with PID 15900 Not found! [root@compute-3-18 ~]# ompi-checkpoint 15900 -v [compute-3-18.local:15987] [[42011,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 [compute-3-18.local:15987] HNP with PID 15900 Not found! About the transport protocol, SGE can use qrsh or ssh to expand the mpi process along the compute nodes. Currently, I'm using qrsh and it could be the problem. I will test the other one. Thanks! Regards, Sergio. Alan Woodland escribió: > 2009/11/3 Sergio Díaz <[email protected]> > >> I can do checkpointing of an easy program without SGE (just in one compute with 2 mpi process >> for instance). Now, I'm trying to do the integration openmpi+sge but I have some problems... When > I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID >> doesn't exit. The example below. >> > > That error looks like the error when job wasn't started with "-am > ft-enable-cr" passed to MPI run. Given that the output you pasted > shows "-am ft-enable-cr" was present this would lead me to suspect > that something went wrong during the startup of mpirun. Do you have > logs of std{out,err} from this at all. IIRC if checkpointing setup > fails in OpenMPI at startup for some reason a few messages get printed > and things just carry on regardless. Is there anything helpful in a > verbose/debug output too? > > >> There is a script to do it automatic with SGE?. For instance, to do checkpointing each X seconds >> with BLCR and non-mpi jobs, there is an script that I adapted to my case. It is launched by SGE if >> you have configured the queue and the ckpt environment. >> > > I've never used SGE, only Condor, and I've never done MPI+BLCR+Condor > so I can't really help there I'm afraid. Is it possible SGE is making > mpi use a transport other than sm, tcp or self? I'm not sure if the > checkpointing code works with other transports. > > >> Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the >> option to do it. >> > > I think mpirun --tmpdir might help with this one? > > [snip] > > Alan > > > -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: [email protected] ; http://www.cesga.es/ ------------------------------------------------