From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Thu Oct 30 2008 - 06:46:37 PST
We are using BLCR checkpoint with SGE, for migrating and restarting jobs. The script we use is a slightly modified version from the ones suggested in the corresponding integration document (specially after version 0.7). The scripts used for migration and for restart, basically determine the PID of the running process to be checkpointed using pstree. The modified script (for blcr >= 0.7) has a line like this pstree -p $SGE_PID | awk 'BEGIN { RS="" }; { print $1 }' | awk -F "(" '{ print $NF }' | awk -F ")" '{ print $1 }' Here, the $SGE_PID is the PID of the SGE execution shell. This line, as is, works fine to get the PID of a "serial" running process which should be checkpointed, killed and restarted/migrated (the modification that I had to make was because in the new releases of cr_restart it is threaded, and this was not considered in the original script). As I said, this works fine, for serial applications. Now, I want to checkpoint and restart OpenMP multithreaded applications. If I use pstree for such an application running from a SGE script with SGE_PID=2444, I get # pstree -p 2444 438(2444)---cr_restart(2446)-+-my_exec(1856)-+-{my_exec}(1861) | |-{my_exec}(1858) | |-{my_exec}(1859) | `-{my_exec}(1860) `-{cr_restart}(2447) And the SGE-checkpointing script gets the PID 1861 (in this case), which is the last PID in the first line. After this the SGE-script would run cr_checkpoint --run 1861 Finally, my question is: Is fine to just give this PID to cr_chekpoint, or I should give the 1856, in this case. I suspected that I had to give the (parent) 1856 PID, but so far I have tested it seems to work anyway just giving the 1861. Is this fortuitous?, should I change the script to get the "parent" PID? Thank you in advance for your help, best regards, adolfo -- Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>