From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jan 26 2010 - 16:10:44 PST
Leonardo, I am very sorry I have not followed up with you on this problem you were having. Since I have not heard otherwise, I am going to assume you are still having problems. Could you please enter a bug report at https://upc-bugs.lbl.gov/bugzilla and I'll follow up there. Use of bugzilla will make tracking the problem easier than via email (making it less likely that I'll forget about it as happened with your email). If it is possible, please attach to the bug report the full source code for the problem so I can try to reproduce it for myself. If you rather not place the entire source code in a public place like our bugzilla, then you can email to me directly at PHHargrove_at_lbl_dot_gov instead of reply to this email (which is a publicly archived list). Thanks (and again my apologies for not having responded any sooner), -Paul Leonardo Fialho wrote: > Thanks Paul! The problem in my application was the lack of the cr_reap_restart(). Now it is working... not at all. > > During the execution, sometimes the application finishes with signal 7, Bus error. Analyzing the core file, I found this: > > Program terminated with signal 7, Bus error. > #0 cr_wait_restart (handle=<value optimized out>, timeout=<value optimized out>) at cr_request.c:77 > 77 FD_SET(fd, &rfds); > (gdb) > > It appears to be an error while trying to assign or use the args.cr_fd. My code is: > > cr_init(); > cr_request_restart(&r_args, &r_handle); > close(r_args.cr_fd); > cr_wait_restart(&r_handle, NULL); > cr_reap_restart(&r_handle); > > I wrote this code using the cr_restart tool as reference. > > This code normally works, but at some moment along the execution it generates the Bus Error. How can I avoid it? When is safe to close this fd? > > Thanks, > Leonardo Fialho > > On Dec 12, 2009, at 1:06 AM, Paul H. Hargrove wrote: > > >> Leonardo, >> >> A call to cr_init() from any given thread is valid for that thread "forever" including across restarts, but should NOT open a new connection to /proc/checkpoint/ctrl for each one. >> For each cr_restart_request() call there is an internal connection to /proc/checkpoint/ctrl. So for each such call you will need to ensure you eventually do the cr_reap_restart() call (perhaps indirectly via cr_poll_restart() or cr_poll_restart_msg()). Failure to cr_reap_restart() will result in leaking the internal connection. I believe this is why your application is accumulating hundreds of these connections. >> >> I am not certain I entirely understood the "My question is" part of your email. If I have not addressed your concern, please ask again and we'll try to answer. >> >> -Paul >> >> Leonardo Fialho wrote: >> >>> Hi, >>> >>> I really don't know if it is a bug or whatever, but I'll describe i short words the problem. >>> >>> I did a small application which creates two threads, one or checkpointing and another to insert faults. The main code forks a matrix multiplication program which is the target of both threads. >>> >>> My first approach was made using cr_run, cr_checkpoint and ch_restart utilities (forked by threads), after *some faults* and restarts the application simply hangs. The ps shows the cr_restart as a defunct program only. >>> >>> I changed my application to use the BLCR API. The problems persists. So, using lsof I saw that I did a mistake during the recovery. Before each cr_request_restart I have used a cr_init. It means that after 500 restarts I had 500 /proc/checkpoint/ctrl opened connections. And after some amount of connections (1024?) the applications hangs again. I changed my code and it, now, appears to run quite well. >>> >>> My questions is: using cr_restart forked by the main application, the cr_init called by the forked process still opened along the process lifecycle? If it occurs, it is a big problem for long time running applications. >>> >>> Thanks, >>> Leonardo Fialho >>> >>> >> -- >> Paul H. Hargrove PHHargrove_at_lbl_dot_gov >> Future Technologies Group Tel: +1-510-495-2352 >> HPC Research Department Fax: +1-510-486-6900 >> Lawrence Berkeley National Laboratory >> > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory