Re: restart failed:Device or resource busy,found pid 4818 in use

Date view	Thread view	Subject view	Author view	Attachment view

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 25 2010 - 18:23:13 PDT

Next message: TK: "Re: question about "cr_save_mmaps_data" function"

Previous message: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
In reply to: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Next in thread: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Reply: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"

I have killed the orginal MPI job manually on the master node using kill
command, and then I restart the job, it couldn't be that reason.

or I need to kill the process both on master and slave nodes?

cheers
fengguang

On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:

> The message says that there are some pids (process IDs) in use (allocated
> to running processes) that are needed for the restart.
> This typically happens if one tries to restart when the original run has
> not yet exited, for instance if there are portions of it hung.
>
> With very large clusters it becomes a statistically significant possibility
> that one could have a few random collisions with other processes on the
> nodes.
> However the number and grouping of the pids, I strongly suspect the
> original MPI job is still running or is hung.
>
> -Paul
>
>
> fengguang tian wrote:
>
>> Hi
>>
>> when I use ompi-restart to restart the checkpoint file in clusters(using
>> open MPI), error happened,it shows:
>> - found pid 4813 in use
>> - found pid 4824 in use
>> - found pid 4827 in use
>> Restart failed: Device or resource busy
>> - found pid 4812 in use
>> - found pid 4822 in use
>> - found pid 4823 in use
>> Restart failed: Device or resource busy
>> - found pid 4815 in use
>> - found pid 4828 in use
>> - found pid 4829 in use
>> Restart failed: Device or resource busy
>> - found pid 4818 in use
>> - found pid 4819 in use
>> Restart failed: Device or resource busy
>> - found pid 4814 in use
>> - found pid 4825 in use
>> - found pid 4826 in use
>> Restart failed: Device or resource busy
>>
>>
>> why would this happen?
>>
>> cheers
>> fengguang
>>
>
>
> --
> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
> Future Technologies Group                 Tel: +1-510-495-2352
> HPC Research Department                   Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory
>

Next message: TK: "Re: question about "cr_save_mmaps_data" function"

Previous message: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
In reply to: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Next in thread: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Reply: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"

Date view	Thread view	Subject view	Author view	Attachment view