Re: problems with cr_checkpoint: ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP):Input/output error

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 21 2008 - 09:53:20 PST

  • Next message: José M. Martín: "Re: problems with cr_checkpoint: ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP):Input/outputerror"
    José,
    
      If you only see problems with GlusterFS, then it might be a problem w/
    GlusterFS, but it might still be a problem with BLCR.  I know almost
    nothing about GlusterFS, but did see at their wiki that it is a
    user-space filesystem.  It is possible that could interact
    poorly/strangely with BLCR, which initiates writes from kernel addresses.
      If you are interested in debugging the problem, I will provide what
    assistance I can by email.
    
    -Paul
    
    José M. Martín wrote:
    > I have done some aditional test.
    > 
    > It  only fails on a volume mounted with GlusterFS, a distribuited FS. In local 
    > drive, it works. So, it must be a issue with this FS. 
    > 
    > There are no entries in /var/log/messages and dmesg about the error.
    > 
    > Thanks,
    > 
    > José
    > 
    > 
    > 
    > 
    > El Wednesday 20 February 2008 18:07:30 Paul H. Hargrove escribió:
    >> José,
    >>
    >>   Sorry the error reporting isn't very clear.  That is one of the weaker
    >> parts of BLCR right now.
    >>   Since the testsuite passes, the most likely reason for the message you
    >> see is an actual I/O failure when trying to write out the checkpoint
    >> context file for your application.  The BLCR code will map (nearly) all
    >> failed write() calls to EIO, even if the actual cause was an
    >> out-of-space or over-quota error.
    >>   You might find some useful information in /var/log/messages, or via
    >> dmesg, about what BLCR was doing at the time of the error.  If you can
    >> send us those messages, we may be able to narrow down what the problem is.
    >>
    >> -Paul
    >>
    >> P.S.
    >> I will ensure the next release of BLCR produces a less confusing error
    >> message, such as "cr_checkpoint: checkpoint failed: Input/output
    >> error".  There really should be no reference to the internal ioctl() call.
    >>
    >> José M. Martín wrote:
    >>> Hello,
    >>>
    >>> first, thanks for this project.
    >>>
    >>> I tried to set up blcr, but I have a problem. When I lunch a program and
    >>> I do the checkpoint, I get the following error:
    >>> ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error
    >>>
    >>> I have tried with kernels 2.6.20 (vanilla) and 2.6.18.8-0.8 (opensuse
    >>> 10.2 default) on a node. On both, I get the same error.
    >>> Nevertheless, on other node with opensuse 10.2 and kernel 2.6.23.1, it
    >>> runs without problem.
    >>>
    >>> I have passed the testsuite:
    >>> ======================
    >>> All 34 tests passed
    >>> (1 tests were not run)
    >>> ======================
    >>>
    >>> No hugetlbfs mount point found (test skipped)
    >>> SKIP: hugetlbfs.ct
    >>>
    >>> I can load the blcr modules without problem, execute binaries, link
    >>> libraries,...
    >>>
    >>> I'm using version 0.6.4
    >>> Nodes are x86 (Pentium 4)
    >>>
    >>> Any help will be apreciated.
    >>>
    >>> Thanks in advance
    > 
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: José M. Martín: "Re: problems with cr_checkpoint: ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP):Input/outputerror"