Re: /proc/checkpoint/ctrl limit?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Dec 11 2009 - 16:06:00 PST

  • Next message: colin hu: "dimmer"
      A call to cr_init() from any given thread is valid for that thread 
    "forever" including across restarts, but should NOT open a new 
    connection to /proc/checkpoint/ctrl for each one.
      For each cr_restart_request() call there is an internal connection to 
    /proc/checkpoint/ctrl.  So for each such call you will need to ensure 
    you eventually do the cr_reap_restart() call (perhaps indirectly via 
    cr_poll_restart() or cr_poll_restart_msg()).  Failure to 
    cr_reap_restart() will result in leaking the internal connection.  I 
    believe this is why your application is accumulating hundreds of these 
      I am not certain I entirely understood the "My question is" part of 
    your email.  If I have not addressed your concern, please ask again and 
    we'll try to answer.
    Leonardo Fialho wrote:
    > Hi,
    > I really don't know if it is a bug or whatever, but I'll describe i short words the problem.
    > I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads.
    > My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only.
    > I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well.
    > My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications.
    > Thanks,
    > Leonardo Fialho
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     

  • Next message: colin hu: "dimmer"