Re: trying to integrate OpenMPI+BLCR+SGE

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon Nov 09 2009 - 04:28:48 PST

  • Next message: Paul H. Hargrove: "Re: Cumulative patch for BLCR-0.8.2 and recent kernels"
    Thank Josh!
    
    So, I'm going to answer you in the openmpi list.
    
    Regards,
    Sergio
    
    Josh Hursey escribió:
    > Since this is an Open MPI problem (rather than a BLCR problem) I 
    > replied to the similar message on the ompi-users list suggesting some 
    > options for further investigation. For those interested in this thread 
    > you can find it in the archives at the link below:
    >  http://www.open-mpi.org/community/lists/users/2009/11/11088.php
    >
    > -- Josh
    >
    > On Nov 3, 2009, at 4:32 AM, Sergio Díaz wrote:
    >
    >> Hello,
    >>
    >> I can do checkpointing of an easy program without SGE (just in one 
    >> compute with 2 mpi process for instance). Now, I'm trying to do the 
    >> integration openmpi+sge but I have some problems... When I try to do 
    >> checkpoint of the mpirun PID, I got an error similar to the error 
    >> gotten when the PID doesn't exit. The example below.
    >>
    >> There is a script to do it automatic with SGE?. For instance, to do 
    >> checkpointing each X seconds with BLCR and non-mpi jobs, there is an 
    >> script that I adapted to my case. It is launched by SGE if you have 
    >> configured the queue and the ckpt environment.
    >>
    >> Is it possible choose the name of the ckpt folder when you do the 
    >> ompi-checkpoint? I can't find the option to do it.
    >>
    >> I found a C program to test ompi-checkpoint/restart and it works 
    >> fine. The program was written by Alan Woodland and shared in the 
    >> following distribution list: debian-bugs-dist_at_lists_dot_debian_dot_org
    >> This program starts a countdown from 10 to 0 and when the countdown 
    >> is 6, do a checkpoint, kill the process and restart the process.
    >>
    >> Any ideas?
    >>
    >> Best regards
    >> Sergio
    >>
    >>
    >>> --------------------------------
    >>>
    >>> [sdiaz@compute-3-17 ~]$ ps auxf
    >>> ....
    >>> root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_ 
    >>> sge_shepherd-2645150 -bg
    >>> sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00      
    >>> \_ -bash 
    >>> /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
    >>> sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28   
    >>> 0:00          \_ mpirun -np 2 -am ft-enable-cr pi3
    >>> sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28   
    >>> 0:00              \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit 
    >>> -nostdin -V compute-3-18..........
    >>> sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28   
    >>> 0:00              \_ pi3
    >>>
    >>>
    >>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
    >>> [compute-3-17.local:20124] HNP with PID 20112 Not found!
    >>>
    >>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
    >>> [compute-3-17.local:20135] HNP with PID 20112 Not found!
    >>>
    >>> [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
    >>> [compute-3-17.local:20136] HNP with PID 20112 Not found!
    >>>
    >>> [sdiaz@compute-3-17 ~]$ exit
    >>> logout
    >>> Connection to c3-17 closed.
    >>> [sdiaz@svgd mpi_test]$ ssh c3-18
    >>> Last login: Wed Oct 28 13:24:12 2009 from svgd.local
    >>> -bash-3.00$ ps auxf |grep sdiaz
    >>>
    >>> sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00      
    >>> \_ /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
    >>> /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 
    >>>
    >>> sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28   
    >>> 0:00          \_ orted -mca ess env -mca orte_ess_jobid 2295267328 
    >>> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
    >>> 2295267328.0;tcp://192.168.4.144:36596 -mca 
    >>> mca_base_param_file_prefix ft-enable-cr -mca 
    >>> mca_base_param_file_path 
    >>> /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test 
    >>> -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
    >>> sdiaz    14420  0.0  0.0 99452 4596 ?        Sl   13:28   
    >>> 0:00              \_ pi3
    >>>
    >>>
    >>
    >> ------------------------------------------------
    >> _______________________________________________
    >> users mailing list
    >> users@open-mpi.org
    >> http://www.open-mpi.org/mailman/listinfo.cgi/users
    >> -- 
    >> Sergio Díaz Montes
    >> Centro de Supercomputacion de Galicia
    >> Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    >> Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    >> email: sdiaz@cesga.es ; http://www.cesga.es/
    >> <image002.jpg>
    >> ------------------------------------------------
    >
    >
    >
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: sdiaz@cesga.es ; http://www.cesga.es/
    
    ------------------------------------------------
    

  • Next message: Paul H. Hargrove: "Re: Cumulative patch for BLCR-0.8.2 and recent kernels"