From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Oct 31 2008 - 13:13:06 PST
Adolfo, I am not 100% certain how OpenMP is spawning threads. So, it could be OK or just fortuitous that your current approach is working. However, I can say that I believe that the parent (1856 in your example) should always be safe/correct. However, I am not sure that all of this is necessary anymore (See below). Since 0.7.0, the cr_restart executable has not only been multithreaded, it has also been smart enough to transparently exclude itself from the checkpoint. So, ideally the PID you want to checkpoint in your example would be 2446 rather than 1861 or 1856. That is certainly the easier PID to locate in the pstree output. Prior to 0.7, use of PID 123 on cr_restart(123)--a.out(456) would have resulted in a subsequent restart like cr_restart(789)--cr_restart(123)--a.out(456) This was a reason to parse pstree to find "456". However, with 0.7.0 and newer if you checkpoint 123 in the following cr_restart(123)-+-a.out(456) +-cr_restart(124) the restarted result should be something like cr_restart(789)-+-a.out(456) |-cr_restart(790) In fact, a.out could have nearly arbitrary children (not just multiple threads) and BLCR 0.7.0 and newer should do the right thing when given PID 123 in this example. If you observe something different than I describe above, please let us know. It might even be possible to checkpoint $SGE_PID now, but I am not certain of that. I recommend that you try, because you may be pleasantly surprised. Please let us know of the outcome with my sugegstions. If possible, it would be nice if you could contribute your findings back to the SGE community, perhaps resulting in an update to the integration document. -Paul Adolfo J. Banchio wrote: > We are using BLCR checkpoint with SGE, for migrating > and restarting jobs. > > The script we use is a slightly modified version from > the ones suggested in the corresponding integration document > (specially after version 0.7). > The scripts used for migration and for restart, basically > determine the PID of the running process to be checkpointed using > pstree. The modified script (for blcr >= 0.7) has a line like this > > pstree -p $SGE_PID | awk 'BEGIN { RS="" }; { print $1 }' | awk -F "(" > '{ print $NF }' | awk -F ")" '{ print $1 }' > > Here, the $SGE_PID is the PID of the SGE execution shell. This line, as > is, works fine to get the PID of a "serial" running process which should > be checkpointed, killed and restarted/migrated (the modification that I > had to make was because in the new releases of cr_restart it is > threaded, and this was not considered in the original script). > > As I said, this works fine, for serial applications. > Now, I want to checkpoint and restart OpenMP multithreaded applications. > If I use pstree for such an application running from a SGE script with > SGE_PID=2444, I get > > # pstree -p 2444 > 438(2444)---cr_restart(2446)-+-my_exec(1856)-+-{my_exec}(1861) > | |-{my_exec}(1858) > | |-{my_exec}(1859) > | `-{my_exec}(1860) > `-{cr_restart}(2447) > > > And the SGE-checkpointing script gets the PID 1861 (in this case), which > is the last PID in the first line. > After this the SGE-script would run > > cr_checkpoint --run 1861 > > > Finally, my question is: Is fine to just give this PID to cr_chekpoint, > or I should give the 1856, in this case. > I suspected that I had to give the (parent) 1856 PID, but so far I have > tested it seems to work anyway just giving the 1861. Is this > fortuitous?, should I change the script to get the "parent" PID? > > Thank you in advance for your help, > > best regards, > > adolfo > > > > > > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900