From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon Nov 09 2009 - 04:28:48 PST
Thank Josh! So, I'm going to answer you in the openmpi list. Regards, Sergio Josh Hursey escribió: > Since this is an Open MPI problem (rather than a BLCR problem) I > replied to the similar message on the ompi-users list suggesting some > options for further investigation. For those interested in this thread > you can find it in the archives at the link below: > http://www.open-mpi.org/community/lists/users/2009/11/11088.php > > -- Josh > > On Nov 3, 2009, at 4:32 AM, Sergio Díaz wrote: > >> Hello, >> >> I can do checkpointing of an easy program without SGE (just in one >> compute with 2 mpi process for instance). Now, I'm trying to do the >> integration openmpi+sge but I have some problems... When I try to do >> checkpoint of the mpirun PID, I got an error similar to the error >> gotten when the PID doesn't exit. The example below. >> >> There is a script to do it automatic with SGE?. For instance, to do >> checkpointing each X seconds with BLCR and non-mpi jobs, there is an >> script that I adapted to my case. It is launched by SGE if you have >> configured the queue and the ckpt environment. >> >> Is it possible choose the name of the ckpt folder when you do the >> ompi-checkpoint? I can't find the option to do it. >> >> I found a C program to test ompi-checkpoint/restart and it works >> fine. The program was written by Alan Woodland and shared in the >> following distribution list: debian-bugs-dist_at_lists_dot_debian_dot_org >> This program starts a countdown from 10 to 0 and when the countdown >> is 6, do a checkpoint, kill the process and restart the process. >> >> Any ideas? >> >> Best regards >> Sergio >> >> >>> -------------------------------- >>> >>> [sdiaz@compute-3-17 ~]$ ps auxf >>> .... >>> root 20044 0.0 0.0 4468 1224 ? S 13:28 0:00 \_ >>> sge_shepherd-2645150 -bg >>> sdiaz 20072 0.0 0.0 53172 1212 ? Ss 13:28 0:00 >>> \_ -bash >>> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150 >>> sdiaz 20112 0.2 0.0 41028 2480 ? S 13:28 >>> 0:00 \_ mpirun -np 2 -am ft-enable-cr pi3 >>> sdiaz 20113 0.0 0.0 36484 1824 ? Sl 13:28 >>> 0:00 \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit >>> -nostdin -V compute-3-18.......... >>> sdiaz 20116 1.2 0.0 99464 4616 ? Sl 13:28 >>> 0:00 \_ pi3 >>> >>> >>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112 >>> [compute-3-17.local:20124] HNP with PID 20112 Not found! >>> >>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112 >>> [compute-3-17.local:20135] HNP with PID 20112 Not found! >>> >>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112 >>> [compute-3-17.local:20136] HNP with PID 20112 Not found! >>> >>> [sdiaz@compute-3-17 ~]$ exit >>> logout >>> Connection to c3-17 closed. >>> [sdiaz@svgd mpi_test]$ ssh c3-18 >>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local >>> -bash-3.00$ ps auxf |grep sdiaz >>> >>> sdiaz 14412 0.0 0.0 1888 560 ? Ss 13:28 0:00 >>> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter >>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 >>> >>> sdiaz 14419 0.0 0.0 35728 2260 ? S 13:28 >>> 0:00 \_ orted -mca ess env -mca orte_ess_jobid 2295267328 >>> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri >>> 2295267328.0;tcp://192.168.4.144:36596 -mca >>> mca_base_param_file_prefix ft-enable-cr -mca >>> mca_base_param_file_path >>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test >>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test >>> sdiaz 14420 0.0 0.0 99452 4596 ? Sl 13:28 >>> 0:00 \_ pi3 >>> >>> >> >> ------------------------------------------------ >> _______________________________________________ >> users mailing list >> [email protected] >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> -- >> Sergio Díaz Montes >> Centro de Supercomputacion de Galicia >> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) >> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 >> email: [email protected] ; http://www.cesga.es/ >> <image002.jpg> >> ------------------------------------------------ > > > -- Sergio Díaz Montes Centro de Supercomputacion de Galicia Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain) Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16 email: [email protected] ; http://www.cesga.es/ ------------------------------------------------