Re: How to solve this problem?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 14 2008 - 13:32:32 PST

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.6.2"
    王磊 wrote:
    > Dear Sir,
    > I have a problem when I want to restart my program.
    > I use the pipe mechanism you recommended.I fork a child to request the 
    > restart(after the parent process exits).
    > In the child process, I call system("cr_restart filename") to restart 
    > my program,but it tells "Restart failed: Device or resource busy".
    > In /var/log/messages or dmesg,it shows(I try several times):
    > [314896.808000] cr_rstrt_child [16060]:  PID conflict found by 
    > cr_reserve_ids()
    > [315044.720000] cr_rstrt_child [16136]:  PID conflict found by 
    > cr_reserve_ids()
    > [315771.344000] cr_rstrt_child [16320]:  PID conflict found by 
    > cr_reserve_ids()
    > [316017.984000] cr_rstrt_child [16469]:  PID conflict found by 
    > cr_reserve_ids()
    > I can sure that the parent process which made the checkpoint is exited.
    > So,I think some other processes may still run,but I can not tell why?
    > Thank you very much for your help.
    >
    > Regards
    >
    > Daniel
    
    Daniel,
    
      When I sent you the "restart_self()" code before, I mentioned that I 
    had not actually tested it.  When I went to test it today in response to 
    your e-mail, I encountered the same problem.  I apologize for not having 
    tested my suggestion earlier.
      The reason that you are getting the PID conflict is that the PID of 
    the original process is still in use as the PGID of its child.  The fix 
    for this is quite simple: insert a call to "setpgid(0,0)" in the child 
    process before invoking cr_restart.  If that does not resolve your 
    problem, let me know and I'll see what else I can do to help you.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.6.2"