From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 25 2010 - 18:23:13 PDT
I have killed the orginal MPI job manually on the master node using kill command, and then I restart the job, it couldn't be that reason. or I need to kill the process both on master and slave nodes? cheers fengguang On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote: > The message says that there are some pids (process IDs) in use (allocated > to running processes) that are needed for the restart. > This typically happens if one tries to restart when the original run has > not yet exited, for instance if there are portions of it hung. > > With very large clusters it becomes a statistically significant possibility > that one could have a few random collisions with other processes on the > nodes. > However the number and grouping of the pids, I strongly suspect the > original MPI job is still running or is hung. > > -Paul > > > fengguang tian wrote: > >> Hi >> >> when I use ompi-restart to restart the checkpoint file in clusters(using >> open MPI), error happened,it shows: >> - found pid 4813 in use >> - found pid 4824 in use >> - found pid 4827 in use >> Restart failed: Device or resource busy >> - found pid 4812 in use >> - found pid 4822 in use >> - found pid 4823 in use >> Restart failed: Device or resource busy >> - found pid 4815 in use >> - found pid 4828 in use >> - found pid 4829 in use >> Restart failed: Device or resource busy >> - found pid 4818 in use >> - found pid 4819 in use >> Restart failed: Device or resource busy >> - found pid 4814 in use >> - found pid 4825 in use >> - found pid 4826 in use >> Restart failed: Device or resource busy >> >> >> why would this happen? >> >> cheers >> fengguang >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory >