Re: checkpointing (OpenMP) multithreaded applications within SGE

From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Mon Nov 03 2008 - 05:52:26 PST

  • Next message: drbj153_at_iitg.ernet.in: "Thanks"
    Paul,
    
    thank you for your prompt reply, and help.
    
    I have made some tests, and I found that when the checkpointed
    threaded process starts in a different node, if the checkpoint 
    was done from a child it ends up with a core-dump and an "pthreads
    error".
    
    But, considering your suggestions, I have changed the scripts to
    determine de PID from the cr_restart process or just the second
    PID in the pstree output (in case the SGE was not yet restarted).
    This seems to work fine.
    
    Checkpointing directly the SGE job does not work ("Checkpoint failed:
    support missing from application"), and on the other hand, I think it 
    is not desirable, since, this way, SGE would not be aware that the
    process has to be migrated/restarted.
    
    I will post my suggested changes to the blcr scripts to the SGE 
    community.
    
    best regards,
    
    adolfo
    
    
    
    
    On Fri, 2008-10-31 at 14:13 -0700, Paul H. Hargrove wrote:
    > Adolfo,
    > 
    >    I am not 100% certain how OpenMP is spawning threads.  So, it could be OK 
    > or just fortuitous that your current approach is working.  However, I can say 
    > that I believe that the parent (1856 in your example) should always be 
    > safe/correct.
    > 
    >    However, I am not sure that all of this is necessary anymore (See below). 
    > Since 0.7.0, the cr_restart executable has not only been multithreaded, it has 
    > also been smart enough to transparently exclude itself from the checkpoint. 
    > So, ideally the PID you want to checkpoint in your example would be 2446 
    > rather than 1861 or 1856.  That is certainly the easier PID to locate in the 
    > pstree output.
    > 
    > 
    > Prior to 0.7, use of PID 123 on
    >      cr_restart(123)--a.out(456)
    > 
    > would have resulted in a subsequent restart like
    >      cr_restart(789)--cr_restart(123)--a.out(456)
    > 
    > This was a reason to parse pstree to find "456".
    > 
    > However, with 0.7.0 and newer if you checkpoint 123 in the following
    >      cr_restart(123)-+-a.out(456)
    >                      +-cr_restart(124)
    > 
    > the restarted result should be something like
    >      cr_restart(789)-+-a.out(456)
    >                      |-cr_restart(790)
    > 
    > In fact, a.out could have nearly arbitrary children (not just multiple 
    > threads) and BLCR 0.7.0 and newer should do the right thing when given PID 123 
    > in this example.
    > 
    > If you observe something different than I describe above, please let us know.
    > 
    > It might even be possible to checkpoint $SGE_PID now, but I am not certain of 
    > that.  I recommend that you try, because you may be pleasantly surprised.
    > 
    > Please let us know of the outcome with my sugegstions.
    > If possible, it would be nice if you could contribute your findings back to 
    > the SGE community, perhaps resulting in an update to the integration document.
    > 
    > -Paul
    > 
    > 
    > Adolfo J. Banchio wrote:
    > > We are using BLCR checkpoint with SGE, for migrating
    > > and restarting jobs.
    > > 
    > > The script we use is a slightly modified version from
    > > the ones suggested in the corresponding integration document
    > > (specially after version 0.7). 
    > > The scripts used for migration and for restart, basically 
    > > determine the PID of the running process to be checkpointed using
    > > pstree. The modified script (for blcr >= 0.7) has a line like this
    > > 
    > > pstree -p $SGE_PID | awk 'BEGIN { RS="" }; { print $1 }' | awk -F "("
    > > '{ print $NF }' | awk -F ")" '{ print $1 }'
    > > 
    > > Here, the $SGE_PID is the PID of the SGE execution shell. This line, as
    > > is, works fine to get the PID of a "serial" running process which should
    > > be checkpointed,  killed and restarted/migrated (the modification that I
    > > had to make was because in the new releases of cr_restart it is
    > > threaded, and this was not considered in the original script).
    > > 
    > > As I said, this works fine, for serial applications. 
    > > Now, I want to checkpoint and restart OpenMP multithreaded applications.
    > > If I use pstree for such an application running from a SGE script with
    > > SGE_PID=2444, I get 
    > > 
    > > # pstree -p 2444
    > > 438(2444)---cr_restart(2446)-+-my_exec(1856)-+-{my_exec}(1861)                     
    > >                              |               |-{my_exec}(1858)   
    > >                              |               |-{my_exec}(1859)
    > >                              |               `-{my_exec}(1860) 
    > >                               `-{cr_restart}(2447)
    > > 
    > > 
    > > And the SGE-checkpointing script gets the PID 1861 (in this case), which
    > > is the last PID in the first line. 
    > > After this the SGE-script would run
    > > 
    > > cr_checkpoint --run 1861
    > > 
    > > 
    > > Finally, my question is: Is fine to just give this PID to cr_chekpoint,
    > > or I should give the 1856, in this case. 
    > > I suspected that I had to give the (parent) 1856 PID, but so far I have
    > > tested it seems to work anyway just giving the 1861. Is this
    > > fortuitous?, should I change the script to get the "parent" PID?
    > > 
    > > Thank you in advance for your help,
    > > 
    > > best regards,
    > > 
    > > adolfo
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > > 
    > 
    > 
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    -- 
    Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>
    

  • Next message: drbj153_at_iitg.ernet.in: "Thanks"