From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 25 2010 - 21:07:11 PDT
fengguang, I believe that most MPI implementations will TRY to "do the right thing" if signaled with a SIGTERM or SIGINT (SIGTERM is the default for the kill command). However, it cannot always do so if things are in a bad state such as hung processes. It also cannot do so if you send SIGKILL, which does not allow mpirun any opportunity to kill the application processes. In your case you indicated in previous email that opmi-checkpoint hangs for you. That is probably a good indication that the MPI jobs is in the sort of "bad state" I warned about above. So, you might need to manually kill the MPI application processes on all the nodes now. It is possible that Open MPI may include a command to assist in that, but if so I don't know what it is. -Paul fengguang tian wrote: > I have killed the orginal MPI job manually on the master node using > kill command, and then I restart the job, it couldn't be that reason. > > or I need to kill the process both on master and slave nodes? > > cheers > fengguang > > On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov>> wrote: > > The message says that there are some pids (process IDs) in use > (allocated to running processes) that are needed for the restart. > This typically happens if one tries to restart when the original > run has not yet exited, for instance if there are portions of it hung. > > With very large clusters it becomes a statistically significant > possibility that one could have a few random collisions with other > processes on the nodes. > However the number and grouping of the pids, I strongly suspect > the original MPI job is still running or is hung. > > -Paul > > > fengguang tian wrote: > > Hi > > when I use ompi-restart to restart the checkpoint file in > clusters(using open MPI), error happened,it shows: > - found pid 4813 in use > - found pid 4824 in use > - found pid 4827 in use > Restart failed: Device or resource busy > - found pid 4812 in use > - found pid 4822 in use > - found pid 4823 in use > Restart failed: Device or resource busy > - found pid 4815 in use > - found pid 4828 in use > - found pid 4829 in use > Restart failed: Device or resource busy > - found pid 4818 in use > - found pid 4819 in use > Restart failed: Device or resource busy > - found pid 4814 in use > - found pid 4825 in use > - found pid 4826 in use > Restart failed: Device or resource busy > > > why would this happen? > > cheers > fengguang > > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > <mailto:PHHargrove_at_lbl_dot_gov> > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory