checkpointing (OpenMP) multithreaded applications within SGE

From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Thu Oct 30 2008 - 06:46:37 PST

  • Next message: Paul H. Hargrove: "Re: Checkpointing"
    We are using BLCR checkpoint with SGE, for migrating
    and restarting jobs.
    
    The script we use is a slightly modified version from
    the ones suggested in the corresponding integration document
    (specially after version 0.7). 
    The scripts used for migration and for restart, basically 
    determine the PID of the running process to be checkpointed using
    pstree. The modified script (for blcr >= 0.7) has a line like this
    
    pstree -p $SGE_PID | awk 'BEGIN { RS="" }; { print $1 }' | awk -F "("
    '{ print $NF }' | awk -F ")" '{ print $1 }'
    
    Here, the $SGE_PID is the PID of the SGE execution shell. This line, as
    is, works fine to get the PID of a "serial" running process which should
    be checkpointed,  killed and restarted/migrated (the modification that I
    had to make was because in the new releases of cr_restart it is
    threaded, and this was not considered in the original script).
    
    As I said, this works fine, for serial applications. 
    Now, I want to checkpoint and restart OpenMP multithreaded applications.
    If I use pstree for such an application running from a SGE script with
    SGE_PID=2444, I get 
    
    # pstree -p 2444
    438(2444)---cr_restart(2446)-+-my_exec(1856)-+-{my_exec}(1861)                     
                                 |               |-{my_exec}(1858)   
                                 |               |-{my_exec}(1859)
                                 |               `-{my_exec}(1860) 
                                  `-{cr_restart}(2447)
    
    
    And the SGE-checkpointing script gets the PID 1861 (in this case), which
    is the last PID in the first line. 
    After this the SGE-script would run
    
    cr_checkpoint --run 1861
    
    
    Finally, my question is: Is fine to just give this PID to cr_chekpoint,
    or I should give the 1856, in this case. 
    I suspected that I had to give the (parent) 1856 PID, but so far I have
    tested it seems to work anyway just giving the 1861. Is this
    fortuitous?, should I change the script to get the "parent" PID?
    
    Thank you in advance for your help,
    
    best regards,
    
    adolfo
    
    
    
    
    
    
    
    
    -- 
    Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>
    

  • Next message: Paul H. Hargrove: "Re: Checkpointing"