Re: problems with cr_checkpoint: ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP):Input/outputerror

From: José M. Martín (jmartin_at_onsager.ugr.es)
Date: Fri Feb 22 2008 - 00:27:39 PST

  • Next message: Yuan Wan: "checkpoint java codes"
    Yes, please, I would like to solve the problem, but I am not a guru in this 
    area. What kind of test can I do? There are no error messages in gluster log.
    
    
    El Thursday 21 February 2008 18:53:20 Paul H. Hargrove escribió:
    > José,
    >
    >   If you only see problems with GlusterFS, then it might be a problem w/
    > GlusterFS, but it might still be a problem with BLCR.  I know almost
    > nothing about GlusterFS, but did see at their wiki that it is a
    > user-space filesystem.  It is possible that could interact
    > poorly/strangely with BLCR, which initiates writes from kernel addresses.
    >   If you are interested in debugging the problem, I will provide what
    > assistance I can by email.
    >
    > -Paul
    >
    > José M. Martín wrote:
    > > I have done some aditional test.
    > >
    > > It  only fails on a volume mounted with GlusterFS, a distribuited FS. In
    > > local drive, it works. So, it must be a issue with this FS.
    > >
    > > There are no entries in /var/log/messages and dmesg about the error.
    > >
    > > Thanks,
    > >
    > > José
    > >
    > > El Wednesday 20 February 2008 18:07:30 Paul H. Hargrove escribió:
    > >> José,
    > >>
    > >>   Sorry the error reporting isn't very clear.  That is one of the weaker
    > >> parts of BLCR right now.
    > >>   Since the testsuite passes, the most likely reason for the message you
    > >> see is an actual I/O failure when trying to write out the checkpoint
    > >> context file for your application.  The BLCR code will map (nearly) all
    > >> failed write() calls to EIO, even if the actual cause was an
    > >> out-of-space or over-quota error.
    > >>   You might find some useful information in /var/log/messages, or via
    > >> dmesg, about what BLCR was doing at the time of the error.  If you can
    > >> send us those messages, we may be able to narrow down what the problem
    > >> is.
    > >>
    > >> -Paul
    > >>
    > >> P.S.
    > >> I will ensure the next release of BLCR produces a less confusing error
    > >> message, such as "cr_checkpoint: checkpoint failed: Input/output
    > >> error".  There really should be no reference to the internal ioctl()
    > >> call.
    > >>
    > >> José M. Martín wrote:
    > >>> Hello,
    > >>>
    > >>> first, thanks for this project.
    > >>>
    > >>> I tried to set up blcr, but I have a problem. When I lunch a program
    > >>> and I do the checkpoint, I get the following error:
    > >>> ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error
    > >>>
    > >>> I have tried with kernels 2.6.20 (vanilla) and 2.6.18.8-0.8 (opensuse
    > >>> 10.2 default) on a node. On both, I get the same error.
    > >>> Nevertheless, on other node with opensuse 10.2 and kernel 2.6.23.1, it
    > >>> runs without problem.
    > >>>
    > >>> I have passed the testsuite:
    > >>> ======================
    > >>> All 34 tests passed
    > >>> (1 tests were not run)
    > >>> ======================
    > >>>
    > >>> No hugetlbfs mount point found (test skipped)
    > >>> SKIP: hugetlbfs.ct
    > >>>
    > >>> I can load the blcr modules without problem, execute binaries, link
    > >>> libraries,...
    > >>>
    > >>> I'm using version 0.6.4
    > >>> Nodes are x86 (Pentium 4)
    > >>>
    > >>> Any help will be apreciated.
    > >>>
    > >>> Thanks in advance
    

  • Next message: Yuan Wan: "checkpoint java codes"