From: Michael Klemm (michael.klemm_at_informatik.uni-erlangen.de)
Date: Fri Nov 04 2005 - 07:28:17 PST
Hi Paul! We did some investigation here, and found the cause of the corrupted checkpoints. (Call me Sherlock by now :-) ). Paul H. Hargrove wrote: > The second, less likely, option is that BLCR is terribly confused. If > you could 'ls -l /proc/<pid>/fds' and 'cat /proc/<pid>/maps' for the > running application, look for the /var/run/nscd file in either place and > let me know what you find. If it is in either place, then BLCR is not > confused. The file name "/var/run/nscd/xxxxxxxx" that was sketched by Christian is a cache file of the NSCD (Name Service Cache Daemon) of Linux. Today, I disabled the service on the machine and the checkpoints can be restarted now. It looks like, that BLCR gets confused by the mmap of NSCD's cache file. For now, we're perfectly satisfied as long as our local admin won't complain about the missing NSCD. However, for his thesis, Christian will be forced to make tests on the cluster of our local computing center. On these machines, getting the NSCD disabled won't be that easy. Do you have any hints how to get the problem solved? Viele Gr��e -michael -- Computer Science Department 2, University of Erlangen-Nuremberg Martensstrasse 3, D-91058 Erlangen, Germany phone: ++49 (0)9131 85-28995, fax: ++49 (0)9131 85-28809 web: http://www2.informatik.uni-erlangen.de/~klemm