Re: Re [2]: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

From: Michael Klemm (michael.klemm_at_informatik.uni-erlangen.de)
Date: Fri Nov 04 2005 - 07:28:17 PST

  • Next message: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"
    Hi Paul!
    
    We did some investigation here, and found the cause of the corrupted
    checkpoints. (Call me Sherlock by now :-) ).
    
    Paul H. Hargrove wrote:
    >   The second, less likely, option is that BLCR is terribly confused.  If
    > you could 'ls -l /proc/<pid>/fds' and 'cat /proc/<pid>/maps' for the
    > running application, look for the /var/run/nscd file in either place and
    > let me know what you find.  If it is in either place, then BLCR is not
    > confused.
    
    The file name "/var/run/nscd/xxxxxxxx" that was sketched by Christian is
    a cache file of the NSCD (Name Service Cache Daemon) of Linux.  Today, I
    disabled the service on the machine and the checkpoints can be restarted
    now.
    
    It looks like, that BLCR gets confused by the mmap of NSCD's cache file.
      For now, we're perfectly satisfied as long as our local admin won't
    complain about the missing NSCD.  However, for his thesis, Christian
    will be forced to make tests on the cluster of our local computing
    center.  On these machines, getting the NSCD disabled won't be that easy.
    
    Do you have any hints how to get the problem solved?
    
    Viele Gre
    	-michael
    
    --
    Computer Science Department 2, University of Erlangen-Nuremberg
    Martensstrasse 3, D-91058 Erlangen, Germany
    phone: ++49 (0)9131 85-28995, fax: ++49 (0)9131 85-28809
    web: http://www2.informatik.uni-erlangen.de/~klemm
    
    


  • Next message: Paul H. Hargrove: "Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument"