From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 25 2010 - 21:15:29 PDT
Hi,Paul so is that a bug?what can I do to avoid this error,the "bad state",to make the checkpoint sucessfully? cheers fengguang On Fri, Mar 26, 2010 at 12:07 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote: > fengguang, > > I believe that most MPI implementations will TRY to "do the right thing" if > signaled with a SIGTERM or SIGINT (SIGTERM is the default for the kill > command). However, it cannot always do so if things are in a bad state such > as hung processes. It also cannot do so if you send SIGKILL, which does not > allow mpirun any opportunity to kill the application processes. > > In your case you indicated in previous email that opmi-checkpoint hangs for > you. That is probably a good indication that the MPI jobs is in the sort of > "bad state" I warned about above. So, you might need to manually kill the > MPI application processes on all the nodes now. It is possible that Open > MPI may include a command to assist in that, but if so I don't know what it > is. > > -Paul > > > fengguang tian wrote: > >> I have killed the orginal MPI job manually on the master node using kill >> command, and then I restart the job, it couldn't be that reason. >> >> or I need to kill the process both on master and slave nodes? >> >> cheers >> fengguang >> >> On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto: >> PHHargrove_at_lbl_dot_gov>> wrote: >> >> The message says that there are some pids (process IDs) in use >> (allocated to running processes) that are needed for the restart. >> This typically happens if one tries to restart when the original >> run has not yet exited, for instance if there are portions of it hung. >> >> With very large clusters it becomes a statistically significant >> possibility that one could have a few random collisions with other >> processes on the nodes. >> However the number and grouping of the pids, I strongly suspect >> the original MPI job is still running or is hung. >> >> -Paul >> >> >> fengguang tian wrote: >> >> Hi >> >> when I use ompi-restart to restart the checkpoint file in >> clusters(using open MPI), error happened,it shows: >> - found pid 4813 in use >> - found pid 4824 in use >> - found pid 4827 in use >> Restart failed: Device or resource busy >> - found pid 4812 in use >> - found pid 4822 in use >> - found pid 4823 in use >> Restart failed: Device or resource busy >> - found pid 4815 in use >> - found pid 4828 in use >> - found pid 4829 in use >> Restart failed: Device or resource busy >> - found pid 4818 in use >> - found pid 4819 in use >> Restart failed: Device or resource busy >> - found pid 4814 in use >> - found pid 4825 in use >> - found pid 4826 in use >> Restart failed: Device or resource busy >> >> >> why would this happen? >> >> cheers >> fengguang >> >> >> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> <mailto:PHHargrove_at_lbl_dot_gov> >> >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> >> > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory >