Re: /proc/checkpoint/ctrl limit?

Date view	Thread view	Subject view	Author view	Attachment view

From: Leonardo Fialho (leonardofialho_at_gmail_dot_com)
Date: Sat Dec 12 2009 - 12:01:44 PST

Next message: Alan Woodland: "Re: BLCR and kernel 2.6.32"

Previous message: colin hu: "dimmer"
In reply to: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"
Next in thread: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"
Reply: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"

Thanks Paul! The problem in my application was the lack of the cr_reap_restart(). Now it is working... not at all.

During the execution, sometimes the application finishes with signal 7, Bus error. Analyzing the core file, I found this:

Program terminated with signal 7, Bus error.
#0  cr_wait_restart (handle=<value optimized out>, timeout=<value optimized out>) at cr_request.c:77
77	    FD_SET(fd, &rfds);
(gdb) 

It appears to be an error while trying to assign or use the args.cr_fd. My code is:

    cr_init();
    cr_request_restart(&r_args, &r_handle);
    close(r_args.cr_fd);
    cr_wait_restart(&r_handle, NULL);
    cr_reap_restart(&r_handle);

I wrote this code using the cr_restart tool as reference.

This code normally works, but at some moment along the execution it generates the Bus Error. How can I avoid it? When is safe to close this fd?

Thanks,
Leonardo Fialho

On Dec 12, 2009, at 1:06 AM, Paul H. Hargrove wrote:

> Leonardo,
> 
> A call to cr_init() from any given thread is valid for that thread "forever" including across restarts, but should NOT open a new connection to /proc/checkpoint/ctrl for each one.
> For each cr_restart_request() call there is an internal connection to /proc/checkpoint/ctrl.  So for each such call you will need to ensure you eventually do the cr_reap_restart() call (perhaps indirectly via cr_poll_restart() or cr_poll_restart_msg()).  Failure to cr_reap_restart() will result in leaking the internal connection.  I believe this is why your application is accumulating hundreds of these connections.
> 
> I am not certain I entirely understood the "My question is" part of your email.  If I have not addressed your concern, please ask again and we'll try to answer.
> 
> -Paul
> 
> Leonardo Fialho wrote:
>> Hi,
>> 
>> I really don't know if it is a bug or whatever, but I'll describe i short words the problem.
>> 
>> I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads.
>> 
>> My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only.
>> 
>> I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well.
>> 
>> My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications.
>> 
>> Thanks,
>> Leonardo Fialho
>>  
> 
> 
> -- 
> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
> Future Technologies Group                 Tel: +1-510-495-2352
> HPC Research Department                   Fax: +1-510-486-6900
> Lawrence Berkeley National Laboratory

Next message: Alan Woodland: "Re: BLCR and kernel 2.6.32"

Previous message: colin hu: "dimmer"
In reply to: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"
Next in thread: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"
Reply: Paul H. Hargrove: "Re: /proc/checkpoint/ctrl limit?"

Date view	Thread view	Subject view	Author view	Attachment view