From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 25 2010 - 17:53:15 PDT
The message says that there are some pids (process IDs) in use (allocated to running processes) that are needed for the restart. This typically happens if one tries to restart when the original run has not yet exited, for instance if there are portions of it hung. With very large clusters it becomes a statistically significant possibility that one could have a few random collisions with other processes on the nodes. However the number and grouping of the pids, I strongly suspect the original MPI job is still running or is hung. -Paul fengguang tian wrote: > Hi > > when I use ompi-restart to restart the checkpoint file in > clusters(using open MPI), error happened,it shows: > - found pid 4813 in use > - found pid 4824 in use > - found pid 4827 in use > Restart failed: Device or resource busy > - found pid 4812 in use > - found pid 4822 in use > - found pid 4823 in use > Restart failed: Device or resource busy > - found pid 4815 in use > - found pid 4828 in use > - found pid 4829 in use > Restart failed: Device or resource busy > - found pid 4818 in use > - found pid 4819 in use > Restart failed: Device or resource busy > - found pid 4814 in use > - found pid 4825 in use > - found pid 4826 in use > Restart failed: Device or resource busy > > > why would this happen? > > cheers > fengguang -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory