From: Alan Woodland (alan.woodland_at_gmail_dot_com)
Date: Wed Nov 04 2009 - 01:37:29 PST
2009/11/3 Sergio D�az <[email protected]> > I can do checkpointing of an easy program without SGE (just in one compute with 2 mpi process > for instance). Now, I'm trying to do the integration openmpi+sge but I have some problems... When > I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID > doesn't exit. The example below. That error looks like the error when job wasn't started with "-am ft-enable-cr" passed to MPI run. Given that the output you pasted shows "-am ft-enable-cr" was present this would lead me to suspect that something went wrong during the startup of mpirun. Do you have logs of std{out,err} from this at all. IIRC if checkpointing setup fails in OpenMPI at startup for some reason a few messages get printed and things just carry on regardless. Is there anything helpful in a verbose/debug output too? > There is a script to do it automatic with SGE?. For instance, to do checkpointing each X seconds > with BLCR and non-mpi jobs, there is an script that I adapted to my case. It is launched by SGE if > you have configured the queue and the ckpt environment. I've never used SGE, only Condor, and I've never done MPI+BLCR+Condor so I can't really help there I'm afraid. Is it possible SGE is making mpi use a transport other than sm, tcp or self? I'm not sure if the checkpointing code works with other transports. > Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the > option to do it. I think mpirun --tmpdir might help with this one? [snip] Alan