From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Fri Nov 06 2009 - 06:15:42 PST
Since this is an Open MPI problem (rather than a BLCR problem) I replied to the similar message on the ompi-users list suggesting some options for further investigation. For those interested in this thread you can find it in the archives at the link below: http://www.open-mpi.org/community/lists/users/2009/11/11088.php -- Josh On Nov 3, 2009, at 4:32 AM, Sergio D�az wrote: > Hello, > > I can do checkpointing of an easy program without SGE (just in one > compute with 2 mpi process for instance). Now, I'm trying to do the > integration openmpi+sge but I have some problems... When I try to do > checkpoint of the mpirun PID, I got an error similar to the error > gotten when the PID doesn't exit. The example below. > > There is a script to do it automatic with SGE?. For instance, to do > checkpointing each X seconds with BLCR and non-mpi jobs, there is an > script that I adapted to my case. It is launched by SGE if you have > configured the queue and the ckpt environment. > > Is it possible choose the name of the ckpt folder when you do the > ompi-checkpoint? I can't find the option to do it. > > I found a C program to test ompi-checkpoint/restart and it works > fine. The program was written by Alan Woodland and shared in the > following distribution list: debian-bugs-dist_at_lists_dot_debian_dot_org > This program starts a countdown from 10 to 0 and when the countdown > is 6, do a checkpoint, kill the process and restart the process. > > Any ideas? > > Best regards > Sergio > > >> -------------------------------- >> >> [sdiaz@compute-3-17 ~]$ ps auxf >> .... >> root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_ >> sge_shepherd-2645150 -bg >> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00 >> \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/ >> 2645150 >> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28 >> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3 >> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28 >> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit - >> nostdin -V compute-3-18.......... >> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28 >> 0:00 \_ pi3 >> >> >> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112 >> [compute-3-17.local:20124] HNP with PID 20112 Not found! >> >> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112 >> [compute-3-17.local:20135] HNP with PID 20112 Not found! >> >> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112 >> [compute-3-17.local:20136] HNP with PID 20112 Not found! >> >> [sdiaz@compute-3-17 ~]$ exit >> logout >> Connection to c3-17 closed. >> [sdiaz@svgd mpi_test]$ ssh c3-18 >> Last login: Wed Oct 28 13:24:12 2009 from svgd.local >> -bash-3.00$ ps auxf |grep sdiaz >> >> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00 >> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/ >> default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 >> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28 >> 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328 - >> mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri >> 2295267328.0;tcp://192.168.4.144:36596 -mca >> mca_base_param_file_prefix ft-enable-cr -mca >> mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/openmpi/ >> amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -mca >> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test >> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28 >> 0:00 \_ pi3 >> >> > > ------------------------------------------------ > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- > Sergio D�az Montes > Centro de Supercomputacion de Galicia > Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) > Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 > email: [email protected] ; http://www.cesga.es/ > <image002.jpg> > ------------------------------------------------