Re: /proc/checkpoint/ctrl limit?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 26 2010 - 16:10:44 PST

  • Next message: Anton Starikov: "possible BLCR bug | Fwd: NFS bug with 2.6.18-164.11.1.el5 kernel"
    Leonardo,
    
    I am very sorry I have not followed up with you on this problem you were 
    having.
    Since I have not heard otherwise, I am going to assume you are still 
    having problems.
    
    Could you please enter a bug report at https://upc-bugs.lbl.gov/bugzilla 
    and I'll follow up there.
    Use of bugzilla will make tracking the problem easier than via email 
    (making it less likely that I'll forget about it as happened with your 
    email).
    
    If it is possible, please attach to the bug report the full source code 
    for the problem so I can try to reproduce it for myself.
    If you rather not place the entire source code in a public place like 
    our bugzilla, then you can email to me directly at PHHargrove_at_lbl_dot_gov 
    instead of reply to this email (which is a publicly archived list).
    
    Thanks (and again my apologies for not having responded any sooner),
    -Paul
    
    Leonardo Fialho wrote:
    > Thanks Paul! The problem in my application was the lack of the cr_reap_restart(). Now it is working... not at all.
    >
    > During the execution, sometimes the application finishes with signal 7, Bus error. Analyzing the core file, I found this:
    >
    > Program terminated with signal 7, Bus error.
    > #0  cr_wait_restart (handle=<value optimized out>, timeout=<value optimized out>) at cr_request.c:77
    > 77	    FD_SET(fd, &rfds);
    > (gdb) 
    >
    > It appears to be an error while trying to assign or use the args.cr_fd. My code is:
    >
    >     cr_init();
    >     cr_request_restart(&r_args, &r_handle);
    >     close(r_args.cr_fd);
    >     cr_wait_restart(&r_handle, NULL);
    >     cr_reap_restart(&r_handle);
    >
    > I wrote this code using the cr_restart tool as reference.
    >
    > This code normally works, but at some moment along the execution it generates the Bus Error. How can I avoid it? When is safe to close this fd?
    >
    > Thanks,
    > Leonardo Fialho
    >
    > On Dec 12, 2009, at 1:06 AM, Paul H. Hargrove wrote:
    >
    >   
    >> Leonardo,
    >>
    >> A call to cr_init() from any given thread is valid for that thread "forever" including across restarts, but should NOT open a new connection to /proc/checkpoint/ctrl for each one.
    >> For each cr_restart_request() call there is an internal connection to /proc/checkpoint/ctrl.  So for each such call you will need to ensure you eventually do the cr_reap_restart() call (perhaps indirectly via cr_poll_restart() or cr_poll_restart_msg()).  Failure to cr_reap_restart() will result in leaking the internal connection.  I believe this is why your application is accumulating hundreds of these connections.
    >>
    >> I am not certain I entirely understood the "My question is" part of your email.  If I have not addressed your concern, please ask again and we'll try to answer.
    >>
    >> -Paul
    >>
    >> Leonardo Fialho wrote:
    >>     
    >>> Hi,
    >>>
    >>> I really don't know if it is a bug or whatever, but I'll describe i short words the problem.
    >>>
    >>> I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads.
    >>>
    >>> My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only.
    >>>
    >>> I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well.
    >>>
    >>> My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications.
    >>>
    >>> Thanks,
    >>> Leonardo Fialho
    >>>  
    >>>       
    >> -- 
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group                 Tel: +1-510-495-2352
    >> HPC Research Department                   Fax: +1-510-486-6900
    >> Lawrence Berkeley National Laboratory     
    >>     
    >
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Anton Starikov: "possible BLCR bug | Fwd: NFS bug with 2.6.18-164.11.1.el5 kernel"