From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 21 2008 - 09:53:20 PST
Jos�, If you only see problems with GlusterFS, then it might be a problem w/ GlusterFS, but it might still be a problem with BLCR. I know almost nothing about GlusterFS, but did see at their wiki that it is a user-space filesystem. It is possible that could interact poorly/strangely with BLCR, which initiates writes from kernel addresses. If you are interested in debugging the problem, I will provide what assistance I can by email. -Paul Jos� M. Mart�n wrote: > I have done some aditional test. > > It only fails on a volume mounted with GlusterFS, a distribuited FS. In local > drive, it works. So, it must be a issue with this FS. > > There are no entries in /var/log/messages and dmesg about the error. > > Thanks, > > Jos� > > > > > El Wednesday 20 February 2008 18:07:30 Paul H. Hargrove escribi�: >> Jos�, >> >> Sorry the error reporting isn't very clear. That is one of the weaker >> parts of BLCR right now. >> Since the testsuite passes, the most likely reason for the message you >> see is an actual I/O failure when trying to write out the checkpoint >> context file for your application. The BLCR code will map (nearly) all >> failed write() calls to EIO, even if the actual cause was an >> out-of-space or over-quota error. >> You might find some useful information in /var/log/messages, or via >> dmesg, about what BLCR was doing at the time of the error. If you can >> send us those messages, we may be able to narrow down what the problem is. >> >> -Paul >> >> P.S. >> I will ensure the next release of BLCR produces a less confusing error >> message, such as "cr_checkpoint: checkpoint failed: Input/output >> error". There really should be no reference to the internal ioctl() call. >> >> Jos� M. Mart�n wrote: >>> Hello, >>> >>> first, thanks for this project. >>> >>> I tried to set up blcr, but I have a problem. When I lunch a program and >>> I do the checkpoint, I get the following error: >>> ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error >>> >>> I have tried with kernels 2.6.20 (vanilla) and 2.6.18.8-0.8 (opensuse >>> 10.2 default) on a node. On both, I get the same error. >>> Nevertheless, on other node with opensuse 10.2 and kernel 2.6.23.1, it >>> runs without problem. >>> >>> I have passed the testsuite: >>> ====================== >>> All 34 tests passed >>> (1 tests were not run) >>> ====================== >>> >>> No hugetlbfs mount point found (test skipped) >>> SKIP: hugetlbfs.ct >>> >>> I can load the blcr modules without problem, execute binaries, link >>> libraries,... >>> >>> I'm using version 0.6.4 >>> Nodes are x86 (Pentium 4) >>> >>> Any help will be apreciated. >>> >>> Thanks in advance > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900