Re: restart failed:Device or resource busy,found pid 4818 in use

Date view	Thread view	Subject view	Author view	Attachment view

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 25 2010 - 21:15:29 PDT

Next message: Tao Ke: "Re: question about "cr_save_mmaps_data" function"

Previous message: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
In reply to: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Next in thread: Leonardo Fialho: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Reply: Leonardo Fialho: "Re: restart failed:Device or resource busy,found pid 4818 in use"

Hi,Paul
so is that a bug?what can I do to avoid this error,the "bad state",to make
the checkpoint sucessfully?

cheers
fengguang

On Fri, Mar 26, 2010 at 12:07 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:

> fengguang,
>
> I believe that most MPI implementations will TRY to "do the right thing" if
> signaled with a SIGTERM or SIGINT (SIGTERM is the default for the kill
> command).  However, it cannot always do so if things are in a bad state such
> as hung processes.  It also cannot do so if you send SIGKILL, which does not
> allow mpirun any opportunity to kill the application processes.
>
> In your case you indicated in previous email that opmi-checkpoint hangs for
> you.  That is probably a good indication that the MPI jobs is in the sort of
> "bad state" I warned about above.  So, you might need to manually kill the
> MPI application processes on all the nodes now.  It is possible that Open
> MPI may include a command to assist in that, but if so I don't know what it
> is.
>
> -Paul
>
>
> fengguang tian wrote:
>
>> I have killed the orginal MPI job manually on the master node using kill
>> command, and then I restart the job, it couldn't be that reason.
>>
>> or I need to kill the process both on master and slave nodes?
>>
>> cheers
>> fengguang
>>
>> On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto:
>> PHHargrove_at_lbl_dot_gov>> wrote:
>>
>>    The message says that there are some pids (process IDs) in use
>>    (allocated to running processes) that are needed for the restart.
>>    This typically happens if one tries to restart when the original
>>    run has not yet exited, for instance if there are portions of it hung.
>>
>>    With very large clusters it becomes a statistically significant
>>    possibility that one could have a few random collisions with other
>>    processes on the nodes.
>>    However the number and grouping of the pids, I strongly suspect
>>    the original MPI job is still running or is hung.
>>
>>    -Paul
>>
>>
>>    fengguang tian wrote:
>>
>>        Hi
>>
>>        when I use ompi-restart to restart the checkpoint file in
>>        clusters(using open MPI), error happened,it shows:
>>        - found pid 4813 in use
>>        - found pid 4824 in use
>>        - found pid 4827 in use
>>        Restart failed: Device or resource busy
>>        - found pid 4812 in use
>>        - found pid 4822 in use
>>        - found pid 4823 in use
>>        Restart failed: Device or resource busy
>>        - found pid 4815 in use
>>        - found pid 4828 in use
>>        - found pid 4829 in use
>>        Restart failed: Device or resource busy
>>        - found pid 4818 in use
>>        - found pid 4819 in use
>>        Restart failed: Device or resource busy
>>        - found pid 4814 in use
>>        - found pid 4825 in use
>>        - found pid 4826 in use
>>        Restart failed: Device or resource busy
>>
>>
>>        why would this happen?
>>
>>        cheers
>>        fengguang
>>
>>
>>
>>    --     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
>>    <mailto:PHHargrove_at_lbl_dot_gov>
>>
>>    Future Technologies Group                 Tel: +1-510-495-2352
>>    HPC Research Department                   Fax: +1-510-486-6900
>>    Lawrence Berkeley National Laboratory
>>
>>
>
> --
> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
> Future Technologies Group                 Tel: +1-510-495-2352
> HPC Research Department                   Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory
>

Next message: Tao Ke: "Re: question about "cr_save_mmaps_data" function"

Previous message: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
In reply to: Paul H. Hargrove: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Next in thread: Leonardo Fialho: "Re: restart failed:Device or resource busy,found pid 4818 in use"
Reply: Leonardo Fialho: "Re: restart failed:Device or resource busy,found pid 4818 in use"

Date view	Thread view	Subject view	Author view	Attachment view