Re: trying to integrate OpenMPI+BLCR+SGE

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Mon Nov 09 2009 - 04:26:27 PST

  • Next message: Sergio Díaz: "Re: trying to integrate OpenMPI+BLCR+SGE"
    Hi Alan,
    
    Using the -v option, I got the same error. But now, I got an extra 
    error... Maybe I'll follow the case with Josh because it seems a opempi 
    problem.
    
    
    [root@compute-3-18 ~]#
    root     15841  0.0  0.0  4468 1224 ?        S    12:56   0:00  \_ 
    sge_shepherd-2726941 -bg
    sdiaz    15869  0.0  0.0 53164 1220 ?        Ss   12:56   0:00      \_ 
    -bash /opt/cesga/sge62/default/spool/compute-3-18/job_scripts/2726941
    sdiaz    15900  0.0  0.0 41028 2480 ?        S    12:56   0:00          
    \_ mpirun -np 2 -am ft-enable-cr ./pi3
    sdiaz    15901  0.0  0.0 36484 1844 ?        Sl   12:56   
    0:00              \_ /opt/cesga/sge62/bin/lx24-x86/qrsh -inherit 
    -nostdin -V compute-3-17.local
    sdiaz    15904  0.0  0.0 99464 4616 ?        Sl   12:56   
    0:00              \_ ./pi3
    
    [root@compute-3-17 ~]#
    root     29855  0.0  0.0 66132 1692 ?        Sl   12:56   0:00  \_ 
    sge_shepherd-2726941 -bg
    sdiaz    29856  0.0  0.0  1888  560 ?        Ss   12:56   0:00      \_ 
    /opt/cesga/sge62/utilbin/lx24-x86/qrsh_starter 
    /opt/cesga/sge62/default/spool/comp
    sdiaz    29863  0.0  0.0 35728 2260 ?        S    12:56   0:00          
    \_ orted -mca ess env -mca orte_ess_jobid 2759065600 -mca orte_ess_vpid 
    1 -mca or
    sdiaz    29864  0.1  0.0 99452 4596 ?        Sl   12:56   
    0:00              \_ ./pi3
    
    [root@compute-3-18 ~]# ompi-checkpoint 15900
    [compute-3-18.local:15986] [[42010,0],0] ORTE_ERROR_LOG: Not found in 
    file orte-checkpoint.c at line 399
    [compute-3-18.local:15986] HNP with PID 15900 Not found!
    [root@compute-3-18 ~]# ompi-checkpoint 15900 -v
    [compute-3-18.local:15987] [[42011,0],0] ORTE_ERROR_LOG: Not found in 
    file orte-checkpoint.c at line 399
    [compute-3-18.local:15987] HNP with PID 15900 Not found!
    
    
    About the transport protocol, SGE can use qrsh or ssh to expand the mpi 
    process along the compute nodes. Currently, I'm using qrsh and it could 
    be the problem. I will test the other one.
    
    Thanks!
    
    Regards,
    Sergio.
    
    
    
    
    
    
    Alan Woodland escribió:
    > 2009/11/3 Sergio Díaz <[email protected]>
    >   
    >> I can do checkpointing of an easy program without SGE (just in one compute with 2 mpi process
    >> for instance). Now, I'm trying to do the integration openmpi+sge but I have some problems... When > I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID
    >> doesn't exit. The example below.
    >>     
    >
    > That error looks like the error when job wasn't started with "-am
    > ft-enable-cr" passed to MPI run. Given that the output you pasted
    > shows "-am ft-enable-cr" was present this would lead me to suspect
    > that something went wrong during the startup of mpirun. Do you have
    > logs of std{out,err} from this at all. IIRC if checkpointing setup
    > fails in OpenMPI at startup for some reason a few messages get printed
    > and things just carry on regardless. Is there anything helpful in a
    > verbose/debug output too?
    >
    >   
    >> There is a script to do it automatic with SGE?. For instance, to do checkpointing each X seconds
    >> with BLCR and non-mpi jobs, there is an script that I adapted to my case. It is launched by SGE if
    >> you have configured the queue and the ckpt environment.
    >>     
    >
    > I've never used SGE, only Condor, and I've never done MPI+BLCR+Condor
    > so I can't really help there I'm afraid. Is it possible SGE is making
    > mpi use a transport other than sm, tcp or self? I'm not sure if the
    > checkpointing code works with other transports.
    >
    >   
    >> Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the
    >> option to do it.
    >>     
    >
    > I think mpirun --tmpdir might help with this one?
    >
    > [snip]
    >
    > Alan
    >
    >
    >   
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: [email protected] ; http://www.cesga.es/
    
    ------------------------------------------------
    

  • Next message: Sergio Díaz: "Re: trying to integrate OpenMPI+BLCR+SGE"