From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Dec 11 2009 - 16:06:00 PST
Leonardo, A call to cr_init() from any given thread is valid for that thread "forever" including across restarts, but should NOT open a new connection to /proc/checkpoint/ctrl for each one. For each cr_restart_request() call there is an internal connection to /proc/checkpoint/ctrl. So for each such call you will need to ensure you eventually do the cr_reap_restart() call (perhaps indirectly via cr_poll_restart() or cr_poll_restart_msg()). Failure to cr_reap_restart() will result in leaking the internal connection. I believe this is why your application is accumulating hundreds of these connections. I am not certain I entirely understood the "My question is" part of your email. If I have not addressed your concern, please ask again and we'll try to answer. -Paul Leonardo Fialho wrote: > Hi, > > I really don't know if it is a bug or whatever, but I'll describe i short words the problem. > > I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads. > > My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only. > > I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well. > > My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications. > > Thanks, > Leonardo Fialho > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory