From: Leonardo Fialho (leonardofialho_at_gmail_dot_com)
Date: Sat Dec 12 2009 - 12:01:44 PST
Thanks Paul! The problem in my application was the lack of the cr_reap_restart(). Now it is working... not at all. During the execution, sometimes the application finishes with signal 7, Bus error. Analyzing the core file, I found this: Program terminated with signal 7, Bus error. #0 cr_wait_restart (handle=<value optimized out>, timeout=<value optimized out>) at cr_request.c:77 77 FD_SET(fd, &rfds); (gdb) It appears to be an error while trying to assign or use the args.cr_fd. My code is: cr_init(); cr_request_restart(&r_args, &r_handle); close(r_args.cr_fd); cr_wait_restart(&r_handle, NULL); cr_reap_restart(&r_handle); I wrote this code using the cr_restart tool as reference. This code normally works, but at some moment along the execution it generates the Bus Error. How can I avoid it? When is safe to close this fd? Thanks, Leonardo Fialho On Dec 12, 2009, at 1:06 AM, Paul H. Hargrove wrote: > Leonardo, > > A call to cr_init() from any given thread is valid for that thread "forever" including across restarts, but should NOT open a new connection to /proc/checkpoint/ctrl for each one. > For each cr_restart_request() call there is an internal connection to /proc/checkpoint/ctrl. So for each such call you will need to ensure you eventually do the cr_reap_restart() call (perhaps indirectly via cr_poll_restart() or cr_poll_restart_msg()). Failure to cr_reap_restart() will result in leaking the internal connection. I believe this is why your application is accumulating hundreds of these connections. > > I am not certain I entirely understood the "My question is" part of your email. If I have not addressed your concern, please ask again and we'll try to answer. > > -Paul > > Leonardo Fialho wrote: >> Hi, >> >> I really don't know if it is a bug or whatever, but I'll describe i short words the problem. >> >> I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads. >> >> My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only. >> >> I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well. >> >> My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications. >> >> Thanks, >> Leonardo Fialho >> > > > -- > Paul H. Hargrove PHHargrove_at_lbl_dot_gov > Future Technologies Group Tel: +1-510-495-2352 > HPC Research Department Fax: +1-510-486-6900 > Lawrence Berkeley National Laboratory