Re: /proc/checkpoint/ctrl limit?

From: Leonardo Fialho (leonardofialho_at_gmail_dot_com)
Date: Sat Dec 12 2009 - 12:01:44 PST

  • Next message: Alan Woodland: "Re: BLCR and kernel 2.6.32"
    Thanks Paul! The problem in my application was the lack of the cr_reap_restart(). Now it is working... not at all.
    
    During the execution, sometimes the application finishes with signal 7, Bus error. Analyzing the core file, I found this:
    
    Program terminated with signal 7, Bus error.
    #0  cr_wait_restart (handle=<value optimized out>, timeout=<value optimized out>) at cr_request.c:77
    77	    FD_SET(fd, &rfds);
    (gdb) 
    
    It appears to be an error while trying to assign or use the args.cr_fd. My code is:
    
        cr_init();
        cr_request_restart(&r_args, &r_handle);
        close(r_args.cr_fd);
        cr_wait_restart(&r_handle, NULL);
        cr_reap_restart(&r_handle);
    
    I wrote this code using the cr_restart tool as reference.
    
    This code normally works, but at some moment along the execution it generates the Bus Error. How can I avoid it? When is safe to close this fd?
    
    Thanks,
    Leonardo Fialho
    
    On Dec 12, 2009, at 1:06 AM, Paul H. Hargrove wrote:
    
    > Leonardo,
    > 
    > A call to cr_init() from any given thread is valid for that thread "forever" including across restarts, but should NOT open a new connection to /proc/checkpoint/ctrl for each one.
    > For each cr_restart_request() call there is an internal connection to /proc/checkpoint/ctrl.  So for each such call you will need to ensure you eventually do the cr_reap_restart() call (perhaps indirectly via cr_poll_restart() or cr_poll_restart_msg()).  Failure to cr_reap_restart() will result in leaking the internal connection.  I believe this is why your application is accumulating hundreds of these connections.
    > 
    > I am not certain I entirely understood the "My question is" part of your email.  If I have not addressed your concern, please ask again and we'll try to answer.
    > 
    > -Paul
    > 
    > Leonardo Fialho wrote:
    >> Hi,
    >> 
    >> I really don't know if it is a bug or whatever, but I'll describe i short words the problem.
    >> 
    >> I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads.
    >> 
    >> My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only.
    >> 
    >> I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well.
    >> 
    >> My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications.
    >> 
    >> Thanks,
    >> Leonardo Fialho
    >>  
    > 
    > 
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory     
    

  • Next message: Alan Woodland: "Re: BLCR and kernel 2.6.32"