Re: checkpointing (OpenMP) multithreaded applications within SGE

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Oct 31 2008 - 13:13:06 PST

  • Next message: Adolfo J. Banchio: "Re: checkpointing (OpenMP) multithreaded applications within SGE"
    Adolfo,
    
       I am not 100% certain how OpenMP is spawning threads.  So, it could be OK 
    or just fortuitous that your current approach is working.  However, I can say 
    that I believe that the parent (1856 in your example) should always be 
    safe/correct.
    
       However, I am not sure that all of this is necessary anymore (See below). 
    Since 0.7.0, the cr_restart executable has not only been multithreaded, it has 
    also been smart enough to transparently exclude itself from the checkpoint. 
    So, ideally the PID you want to checkpoint in your example would be 2446 
    rather than 1861 or 1856.  That is certainly the easier PID to locate in the 
    pstree output.
    
    
    Prior to 0.7, use of PID 123 on
         cr_restart(123)--a.out(456)
    
    would have resulted in a subsequent restart like
         cr_restart(789)--cr_restart(123)--a.out(456)
    
    This was a reason to parse pstree to find "456".
    
    However, with 0.7.0 and newer if you checkpoint 123 in the following
         cr_restart(123)-+-a.out(456)
                         +-cr_restart(124)
    
    the restarted result should be something like
         cr_restart(789)-+-a.out(456)
                         |-cr_restart(790)
    
    In fact, a.out could have nearly arbitrary children (not just multiple 
    threads) and BLCR 0.7.0 and newer should do the right thing when given PID 123 
    in this example.
    
    If you observe something different than I describe above, please let us know.
    
    It might even be possible to checkpoint $SGE_PID now, but I am not certain of 
    that.  I recommend that you try, because you may be pleasantly surprised.
    
    Please let us know of the outcome with my sugegstions.
    If possible, it would be nice if you could contribute your findings back to 
    the SGE community, perhaps resulting in an update to the integration document.
    
    -Paul
    
    
    Adolfo J. Banchio wrote:
    > We are using BLCR checkpoint with SGE, for migrating
    > and restarting jobs.
    > 
    > The script we use is a slightly modified version from
    > the ones suggested in the corresponding integration document
    > (specially after version 0.7). 
    > The scripts used for migration and for restart, basically 
    > determine the PID of the running process to be checkpointed using
    > pstree. The modified script (for blcr >= 0.7) has a line like this
    > 
    > pstree -p $SGE_PID | awk 'BEGIN { RS="" }; { print $1 }' | awk -F "("
    > '{ print $NF }' | awk -F ")" '{ print $1 }'
    > 
    > Here, the $SGE_PID is the PID of the SGE execution shell. This line, as
    > is, works fine to get the PID of a "serial" running process which should
    > be checkpointed,  killed and restarted/migrated (the modification that I
    > had to make was because in the new releases of cr_restart it is
    > threaded, and this was not considered in the original script).
    > 
    > As I said, this works fine, for serial applications. 
    > Now, I want to checkpoint and restart OpenMP multithreaded applications.
    > If I use pstree for such an application running from a SGE script with
    > SGE_PID=2444, I get 
    > 
    > # pstree -p 2444
    > 438(2444)---cr_restart(2446)-+-my_exec(1856)-+-{my_exec}(1861)                     
    >                              |               |-{my_exec}(1858)   
    >                              |               |-{my_exec}(1859)
    >                              |               `-{my_exec}(1860) 
    >                               `-{cr_restart}(2447)
    > 
    > 
    > And the SGE-checkpointing script gets the PID 1861 (in this case), which
    > is the last PID in the first line. 
    > After this the SGE-script would run
    > 
    > cr_checkpoint --run 1861
    > 
    > 
    > Finally, my question is: Is fine to just give this PID to cr_chekpoint,
    > or I should give the 1856, in this case. 
    > I suspected that I had to give the (parent) 1856 PID, but so far I have
    > tested it seems to work anyway just giving the 1861. Is this
    > fortuitous?, should I change the script to get the "parent" PID?
    > 
    > Thank you in advance for your help,
    > 
    > best regards,
    > 
    > adolfo
    > 
    > 
    > 
    > 
    > 
    > 
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Adolfo J. Banchio: "Re: checkpointing (OpenMP) multithreaded applications within SGE"