From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jan 14 2008 - 13:32:32 PST
王磊 wrote: > Dear Sir, > I have a problem when I want to restart my program. > I use the pipe mechanism you recommended.I fork a child to request the > restart(after the parent process exits). > In the child process, I call system("cr_restart filename") to restart > my program,but it tells "Restart failed: Device or resource busy". > In /var/log/messages or dmesg,it shows(I try several times): > [314896.808000] cr_rstrt_child [16060]: PID conflict found by > cr_reserve_ids() > [315044.720000] cr_rstrt_child [16136]: PID conflict found by > cr_reserve_ids() > [315771.344000] cr_rstrt_child [16320]: PID conflict found by > cr_reserve_ids() > [316017.984000] cr_rstrt_child [16469]: PID conflict found by > cr_reserve_ids() > I can sure that the parent process which made the checkpoint is exited. > So,I think some other processes may still run,but I can not tell why? > Thank you very much for your help. > > Regards > > Daniel Daniel, When I sent you the "restart_self()" code before, I mentioned that I had not actually tested it. When I went to test it today in response to your e-mail, I encountered the same problem. I apologize for not having tested my suggestion earlier. The reason that you are getting the PID conflict is that the PID of the original process is still in use as the PGID of its child. The fix for this is quite simple: insert a call to "setpgid(0,0)" in the child process before invoking cr_restart. If that does not resolve your problem, let me know and I'll see what else I can do to help you. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900