Re: trying to integrate OpenMPI+BLCR+SGE

From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Fri Nov 06 2009 - 06:15:42 PST

  • Next message: Paul H. Hargrove: "Re: installation problem with gentoo"
    Since this is an Open MPI problem (rather than a BLCR problem) I  
    replied to the similar message on the ompi-users list suggesting some  
    options for further investigation. For those interested in this thread  
    you can find it in the archives at the link below:
      http://www.open-mpi.org/community/lists/users/2009/11/11088.php
    
    -- Josh
    
    On Nov 3, 2009, at 4:32 AM, Sergio D�az wrote:
    
    > Hello,
    >
    > I can do checkpointing of an easy program without SGE (just in one  
    > compute with 2 mpi process for instance). Now, I'm trying to do the  
    > integration openmpi+sge but I have some problems... When I try to do  
    > checkpoint of the mpirun PID, I got an error similar to the error  
    > gotten when the PID doesn't exit. The example below.
    >
    > There is a script to do it automatic with SGE?. For instance, to do  
    > checkpointing each X seconds with BLCR and non-mpi jobs, there is an  
    > script that I adapted to my case. It is launched by SGE if you have  
    > configured the queue and the ckpt environment.
    >
    > Is it possible choose the name of the ckpt folder when you do the  
    > ompi-checkpoint? I can't find the option to do it.
    >
    > I found a C program to test ompi-checkpoint/restart and it works  
    > fine. The program was written by Alan Woodland and shared in the  
    > following distribution list: debian-bugs-dist_at_lists_dot_debian_dot_org
    > This program starts a countdown from 10 to 0 and when the countdown  
    > is 6, do a checkpoint, kill the process and restart the process.
    >
    > Any ideas?
    >
    > Best regards
    > Sergio
    >
    >
    >> --------------------------------
    >>
    >> [sdiaz@compute-3-17 ~]$ ps auxf
    >> ....
    >> root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_  
    >> sge_shepherd-2645150 -bg
    >> sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00       
    >> \_ -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/ 
    >> 2645150
    >> sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28    
    >> 0:00          \_ mpirun -np 2 -am ft-enable-cr pi3
    >> sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28    
    >> 0:00              \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit - 
    >> nostdin -V compute-3-18..........
    >> sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28    
    >> 0:00              \_ pi3
    >>
    >>
    >> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
    >> [compute-3-17.local:20124] HNP with PID 20112 Not found!
    >>
    >> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
    >> [compute-3-17.local:20135] HNP with PID 20112 Not found!
    >>
    >> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
    >> [compute-3-17.local:20136] HNP with PID 20112 Not found!
    >>
    >> [sdiaz@compute-3-17 ~]$ exit
    >> logout
    >> Connection to c3-17 closed.
    >> [sdiaz@svgd mpi_test]$ ssh c3-18
    >> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
    >> -bash-3.00$ ps auxf |grep sdiaz
    >>
    >> sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00       
    >> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter /opt/cesga/sge62/ 
    >> default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18
    >> sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28    
    >> 0:00          \_ orted -mca ess env -mca orte_ess_jobid 2295267328 - 
    >> mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri  
    >> 2295267328.0;tcp://192.168.4.144:36596 -mca  
    >> mca_base_param_file_prefix ft-enable-cr -mca  
    >> mca_base_param_file_path /opt/cesga/openmpi-1.3.3/share/openmpi/ 
    >> amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test -mca  
    >> mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
    >> sdiaz    14420  0.0  0.0 99452 4596 ?        Sl   13:28    
    >> 0:00              \_ pi3
    >>
    >>
    >
    > ------------------------------------------------
    > _______________________________________________
    > users mailing list
    > [email protected]
    > http://www.open-mpi.org/mailman/listinfo.cgi/users
    > -- 
    > Sergio D�az Montes
    > Centro de Supercomputacion de Galicia
    > Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    > Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    > email: [email protected] ; http://www.cesga.es/
    > <image002.jpg>
    > ------------------------------------------------
    

  • Next message: Paul H. Hargrove: "Re: installation problem with gentoo"