Re: trying to integrate OpenMPI+BLCR+SGE

Date view	Thread view	Subject view	Author view	Attachment view

From: Alan Woodland (alan.woodland_at_gmail_dot_com)
Date: Wed Nov 04 2009 - 01:37:29 PST

Next message: Josh Hursey: "Re: trying to integrate OpenMPI+BLCR+SGE"

Previous message: colin hu: "dimmer"
In reply to: Sergio D�az: "trying to integrate OpenMPI+BLCR+SGE"
Next in thread: Sergio D�az: "Re: trying to integrate OpenMPI+BLCR+SGE"
Reply: Sergio D�az: "Re: trying to integrate OpenMPI+BLCR+SGE"

2009/11/3 Sergio D�az <[email protected]>
> I can do checkpointing of an easy program without SGE (just in one compute with 2 mpi process
> for instance). Now, I'm trying to do the integration openmpi+sge but I have some problems... When > I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID
> doesn't exit. The example below.

That error looks like the error when job wasn't started with "-am
ft-enable-cr" passed to MPI run. Given that the output you pasted
shows "-am ft-enable-cr" was present this would lead me to suspect
that something went wrong during the startup of mpirun. Do you have
logs of std{out,err} from this at all. IIRC if checkpointing setup
fails in OpenMPI at startup for some reason a few messages get printed
and things just carry on regardless. Is there anything helpful in a
verbose/debug output too?

> There is a script to do it automatic with SGE?. For instance, to do checkpointing each X seconds
> with BLCR and non-mpi jobs, there is an script that I adapted to my case. It is launched by SGE if
> you have configured the queue and the ckpt environment.

I've never used SGE, only Condor, and I've never done MPI+BLCR+Condor
so I can't really help there I'm afraid. Is it possible SGE is making
mpi use a transport other than sm, tcp or self? I'm not sure if the
checkpointing code works with other transports.

> Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the
> option to do it.

I think mpirun --tmpdir might help with this one?

[snip]

Alan

Next message: Josh Hursey: "Re: trying to integrate OpenMPI+BLCR+SGE"

Previous message: colin hu: "dimmer"
In reply to: Sergio D�az: "trying to integrate OpenMPI+BLCR+SGE"
Next in thread: Sergio D�az: "Re: trying to integrate OpenMPI+BLCR+SGE"
Reply: Sergio D�az: "Re: trying to integrate OpenMPI+BLCR+SGE"

Date view	Thread view	Subject view	Author view	Attachment view