Re: restart failed:Device or resource busy,found pid 4818 in use

From: Leonardo Fialho (leonardofialho_at_gmail_dot_com)
Date: Fri Mar 26 2010 - 08:32:42 PDT

  • Next message: Josh Hursey: "Re: checkpoint hangs when using in clusters"
    fangguang,
    
    No, probably it is not a bug. As Paul said, Open MPI tries to send the termination signal to remote processes, and normally they finish correctly. My suggestion is to try to checkpoint the application using the same ompi-checkpoint command activating the termination option. Try ompi-checkpoint --help to see the correct syntax. Under this circumstances you can try to restore processes from checkpoint, and probably everything will run well.
    
    Leonardo
    
    On Mar 26, 2010, at 1:15 AM, fengguang tian wrote:
    
    > Hi,Paul
    > so is that a bug?what can I do to avoid this error,the "bad state",to make the checkpoint sucessfully?
    > 
    > cheers
    > fengguang
    > 
    > On Fri, Mar 26, 2010 at 12:07 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote:
    > fengguang,
    > 
    > I believe that most MPI implementations will TRY to "do the right thing" if signaled with a SIGTERM or SIGINT (SIGTERM is the default for the kill command).  However, it cannot always do so if things are in a bad state such as hung processes.  It also cannot do so if you send SIGKILL, which does not allow mpirun any opportunity to kill the application processes.
    > 
    > In your case you indicated in previous email that opmi-checkpoint hangs for you.  That is probably a good indication that the MPI jobs is in the sort of "bad state" I warned about above.  So, you might need to manually kill the MPI application processes on all the nodes now.  It is possible that Open MPI may include a command to assist in that, but if so I don't know what it is.
    > 
    > -Paul
    > 
    > 
    > fengguang tian wrote:
    > I have killed the orginal MPI job manually on the master node using kill command, and then I restart the job, it couldn't be that reason.
    > 
    > or I need to kill the process both on master and slave nodes?
    > 
    > cheers
    > fengguang
    > 
    > On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov <mailto:PHHargrove_at_lbl_dot_gov>> wrote:
    > 
    >    The message says that there are some pids (process IDs) in use
    >    (allocated to running processes) that are needed for the restart.
    >    This typically happens if one tries to restart when the original
    >    run has not yet exited, for instance if there are portions of it hung.
    > 
    >    With very large clusters it becomes a statistically significant
    >    possibility that one could have a few random collisions with other
    >    processes on the nodes.
    >    However the number and grouping of the pids, I strongly suspect
    >    the original MPI job is still running or is hung.
    > 
    >    -Paul
    > 
    > 
    >    fengguang tian wrote:
    > 
    >        Hi
    > 
    >        when I use ompi-restart to restart the checkpoint file in
    >        clusters(using open MPI), error happened,it shows:
    >        - found pid 4813 in use
    >        - found pid 4824 in use
    >        - found pid 4827 in use
    >        Restart failed: Device or resource busy
    >        - found pid 4812 in use
    >        - found pid 4822 in use
    >        - found pid 4823 in use
    >        Restart failed: Device or resource busy
    >        - found pid 4815 in use
    >        - found pid 4828 in use
    >        - found pid 4829 in use
    >        Restart failed: Device or resource busy
    >        - found pid 4818 in use
    >        - found pid 4819 in use
    >        Restart failed: Device or resource busy
    >        - found pid 4814 in use
    >        - found pid 4825 in use
    >        - found pid 4826 in use
    >        Restart failed: Device or resource busy
    > 
    > 
    >        why would this happen?
    > 
    >        cheers
    >        fengguang
    > 
    > 
    > 
    >    --     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >    <mailto:PHHargrove_at_lbl_dot_gov>
    > 
    >    Future Technologies Group                 Tel: +1-510-495-2352
    >    HPC Research Department                   Fax: +1-510-486-6900
    >    Lawrence Berkeley National Laboratory    
    > 
    > 
    > 
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory     
    > 
    

  • Next message: Josh Hursey: "Re: checkpoint hangs when using in clusters"