Re: /proc/checkpoint/ctrl limit?

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Dec 11 2009 - 16:06:00 PST

Next message: colin hu: "dimmer"

Previous message: Leonardo Fialho: "/proc/checkpoint/ctrl limit?"
In reply to: Leonardo Fialho: "/proc/checkpoint/ctrl limit?"
Next in thread: Leonardo Fialho: "Re: /proc/checkpoint/ctrl limit?"
Reply: Leonardo Fialho: "Re: /proc/checkpoint/ctrl limit?"

Leonardo,

  A call to cr_init() from any given thread is valid for that thread 
"forever" including across restarts, but should NOT open a new 
connection to /proc/checkpoint/ctrl for each one.
  For each cr_restart_request() call there is an internal connection to 
/proc/checkpoint/ctrl.  So for each such call you will need to ensure 
you eventually do the cr_reap_restart() call (perhaps indirectly via 
cr_poll_restart() or cr_poll_restart_msg()).  Failure to 
cr_reap_restart() will result in leaking the internal connection.  I 
believe this is why your application is accumulating hundreds of these 
connections.

  I am not certain I entirely understood the "My question is" part of 
your email.  If I have not addressed your concern, please ask again and 
we'll try to answer.

-Paul

Leonardo Fialho wrote:
> Hi,
>
> I really don't know if it is a bug or whatever, but I'll describe i short words the problem.
>
> I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads.
>
> My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only.
>
> I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well.
>
> My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications.
>
> Thanks,
> Leonardo Fialho
>   

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group                 Tel: +1-510-495-2352
HPC Research Department                   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory

Next message: colin hu: "dimmer"

Previous message: Leonardo Fialho: "/proc/checkpoint/ctrl limit?"
In reply to: Leonardo Fialho: "/proc/checkpoint/ctrl limit?"
Next in thread: Leonardo Fialho: "Re: /proc/checkpoint/ctrl limit?"
Reply: Leonardo Fialho: "Re: /proc/checkpoint/ctrl limit?"

Date view	Thread view	Subject view	Author view	Attachment view