trying to integrate OpenMPI+BLCR+SGE

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Tue Nov 03 2009 - 03:32:31 PST

  • Next message: colin hu: "dimmer"
    Hello,
    
    I can do checkpointing of an easy program without SGE (just in one 
    compute with 2 mpi process for instance). Now, I'm trying to do the 
    integration openmpi+sge but I have some problems... When I try to do 
    checkpoint of the mpirun PID, I got an error similar to the error gotten 
    when the PID doesn't exit. The example below.
    
    There is a script to do it automatic with SGE?. For instance, to do 
    checkpointing each X seconds with BLCR and non-mpi jobs, there is an 
    script that I adapted to my case. It is launched by SGE if you have 
    configured the queue and the ckpt environment.
    
    Is it possible choose the name of the ckpt folder when you do the 
    ompi-checkpoint? I can't find the option to do it.
    
    I found a C program to test ompi-checkpoint/restart and it works fine. 
    The program was written by Alan Woodland and shared in the following 
    distribution list: debian-bugs-dist_at_lists_dot_debian_dot_org
    This program starts a countdown from 10 to 0 and when the countdown is 
    6, do a checkpoint, kill the process and restart the process.
    
    Any ideas?
    
    Best regards
    Sergio
    
    
    > --------------------------------
    >
    > [sdiaz@compute-3-17 ~]$ ps auxf
    > ....
    > root     20044  0.0  0.0  4468 1224 ?        S    13:28   0:00  \_ 
    > sge_shepherd-2645150 -bg
    > sdiaz    20072  0.0  0.0 53172 1212 ?        Ss   13:28   0:00      \_ 
    > -bash /opt/cesga/sge62/default/spool/compute-3-17/job_scripts/2645150
    > sdiaz    20112  0.2  0.0 41028 2480 ?        S    13:28   
    > 0:00          \_ mpirun -np 2 -am ft-enable-cr pi3
    > sdiaz    20113  0.0  0.0 36484 1824 ?        Sl   13:28   
    > 0:00              \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit 
    > -nostdin -V compute-3-18..........
    > sdiaz    20116  1.2  0.0 99464 4616 ?        Sl   13:28   
    > 0:00              \_ pi3
    >
    >
    > [sdiaz@compute-3-17 ~]$ ompi-checkpoint 20112
    > [compute-3-17.local:20124] HNP with PID 20112 Not found!
    >
    > [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s 20112
    > [compute-3-17.local:20135] HNP with PID 20112 Not found!
    >
    > [sdiaz@compute-3-17 ~]$ ompi-checkpoint -s --term 20112
    > [compute-3-17.local:20136] HNP with PID 20112 Not found!
    >
    > [sdiaz@compute-3-17 ~]$ exit
    > logout
    > Connection to c3-17 closed.
    > [sdiaz@svgd mpi_test]$ ssh c3-18
    > Last login: Wed Oct 28 13:24:12 2009 from svgd.local
    > -bash-3.00$ ps auxf |grep sdiaz
    >
    > sdiaz    14412  0.0  0.0  1888  560 ?        Ss   13:28   0:00      \_ 
    > /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
    > /opt/cesga/sge62/default/spool/compute-3-18/active_jobs/2645150.1/1.compute-3-18 
    >
    > sdiaz    14419  0.0  0.0 35728 2260 ?        S    13:28   
    > 0:00          \_ orted -mca ess env -mca orte_ess_jobid 2295267328 
    > -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri 
    > 2295267328.0;tcp://192.168.4.144:36596 -mca mca_base_param_file_prefix 
    > ft-enable-cr -mca mca_base_param_file_path 
    > /opt/cesga/openmpi-1.3.3/share/openmpi/amca-param-sets:/home_no_usc/cesga/sdiaz/mpi_test 
    > -mca mca_base_param_file_path_force /home_no_usc/cesga/sdiaz/mpi_test
    > sdiaz    14420  0.0  0.0 99452 4596 ?        Sl   13:28   
    > 0:00              \_ pi3
    >
    >
    
    ------------------------------------------------
    _______________________________________________
    users mailing list
    users@open-mpi.org
    http://www.open-mpi.org/mailman/listinfo.cgi/users
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: sdiaz@cesga.es ; http://www.cesga.es/
    
    ------------------------------------------------
    

  • Next message: colin hu: "dimmer"