checkpointing (OpenMP) multithreaded applications within SGE

Date view	Thread view	Subject view	Author view	Attachment view

From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Thu Oct 30 2008 - 06:46:37 PST

Next message: Paul H. Hargrove: "Re: Checkpointing"

Previous message: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
In reply to: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
Next in thread: Paul H. Hargrove: "Re: checkpointing (OpenMP) multithreaded applications within SGE"
Reply: Paul H. Hargrove: "Re: checkpointing (OpenMP) multithreaded applications within SGE"

We are using BLCR checkpoint with SGE, for migrating
and restarting jobs.

The script we use is a slightly modified version from
the ones suggested in the corresponding integration document
(specially after version 0.7). 
The scripts used for migration and for restart, basically 
determine the PID of the running process to be checkpointed using
pstree. The modified script (for blcr >= 0.7) has a line like this

pstree -p $SGE_PID | awk 'BEGIN { RS="" }; { print $1 }' | awk -F "("
'{ print $NF }' | awk -F ")" '{ print $1 }'

Here, the $SGE_PID is the PID of the SGE execution shell. This line, as
is, works fine to get the PID of a "serial" running process which should
be checkpointed,  killed and restarted/migrated (the modification that I
had to make was because in the new releases of cr_restart it is
threaded, and this was not considered in the original script).

As I said, this works fine, for serial applications. 
Now, I want to checkpoint and restart OpenMP multithreaded applications.
If I use pstree for such an application running from a SGE script with
SGE_PID=2444, I get 

# pstree -p 2444
438(2444)---cr_restart(2446)-+-my_exec(1856)-+-{my_exec}(1861)                     
                             |               |-{my_exec}(1858)   
                             |               |-{my_exec}(1859)
                             |               `-{my_exec}(1860) 
                              `-{cr_restart}(2447)


And the SGE-checkpointing script gets the PID 1861 (in this case), which
is the last PID in the first line. 
After this the SGE-script would run

cr_checkpoint --run 1861


Finally, my question is: Is fine to just give this PID to cr_chekpoint,
or I should give the 1856, in this case. 
I suspected that I had to give the (parent) 1856 PID, but so far I have
tested it seems to work anyway just giving the 1861. Is this
fortuitous?, should I change the script to get the "parent" PID?

Thank you in advance for your help,

best regards,

adolfo








-- 
Adolfo J. Banchio <banchio_at_famaf_dot_unc_dot_edu.ar>

Next message: Paul H. Hargrove: "Re: Checkpointing"

Previous message: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
In reply to: Paul H. Hargrove: "Re: kernel oops with blcr-0.7.3"
Next in thread: Paul H. Hargrove: "Re: checkpointing (OpenMP) multithreaded applications within SGE"
Reply: Paul H. Hargrove: "Re: checkpointing (OpenMP) multithreaded applications within SGE"

Date view	Thread view	Subject view	Author view	Attachment view