From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Nov 04 2005 - 09:48:35 PST
I may have a solution. The attached patch should cause BLCR to store the actual contents of any deleted mmaped file, rather than storing just the filename. This should solve the problem if the file is not still open within NSCD (and thus potentially changing). However, if NCSD is also attached to the file (via open() or mmap()) and expects to communicate with the application through this file, then there is no good way for BLCR to save and restore this "communication channel" - the best we could hope for in that case would be to "undelete" the file by linking it back into the filesystem with its original name. That is likely to create a "leak" of such files and so I'd not consider it a general-purpose solution. Let me know if this patch works or not so I can include in the next release (which I am hoping to put out next week). -Paul Michael Klemm wrote: > Hi Paul! > > We did some investigation here, and found the cause of the corrupted > checkpoints. (Call me Sherlock by now :-) ). > > Paul H. Hargrove wrote: >> The second, less likely, option is that BLCR is terribly confused. If >> you could 'ls -l /proc/<pid>/fds' and 'cat /proc/<pid>/maps' for the >> running application, look for the /var/run/nscd file in either place and >> let me know what you find. If it is in either place, then BLCR is not >> confused. > > The file name "/var/run/nscd/xxxxxxxx" that was sketched by Christian is > a cache file of the NSCD (Name Service Cache Daemon) of Linux. Today, I > disabled the service on the machine and the checkpoints can be restarted > now. > > It looks like, that BLCR gets confused by the mmap of NSCD's cache file. > For now, we're perfectly satisfied as long as our local admin won't > complain about the missing NSCD. However, for his thesis, Christian > will be forced to make tests on the cluster of our local computing > center. On these machines, getting the NSCD disabled won't be that easy. > > Do you have any hints how to get the problem solved? > > Viele Gr��e > -michael > > -- > Computer Science Department 2, University of Erlangen-Nuremberg > Martensstrasse 3, D-91058 Erlangen, Germany > phone: ++49 (0)9131 85-28995, fax: ++49 (0)9131 85-28809 > web: http://www2.informatik.uni-erlangen.de/~klemm -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 Index: vmadump/vmadump.c =================================================================== RCS file: /var/local/cvs/lbnl_cr/vmadump/vmadump.c,v retrieving revision 1.59 diff -u -u -r1.59 vmadump.c --- vmadump/vmadump.c 26 Sep 2005 20:19:59 -0000 1.59 +++ vmadump/vmadump.c 2 Nov 2005 22:38:55 -0000 @@ -1210,7 +1210,10 @@ filename = default_map_name(map->vm_file, buffer, PAGE_SIZE); head.namelen = strlen(filename); - if (map->vm_flags & VM_IO) { + if ((head.namelen > 10) && !strcmp(filename + head.namelen - 10, " (deleted)")) { + /* Region is a deleted file */ + head.namelen = 0; + } else if (map->vm_flags & VM_IO) { /* Region is an IO map. */ /* Never store the contents of a VM_IO region */ Index: vmadump4/vmadump_common.c =================================================================== RCS file: /var/local/cvs/lbnl_cr/vmadump4/vmadump_common.c,v retrieving revision 1.17 diff -u -u -r1.17 vmadump_common.c --- vmadump4/vmadump_common.c 27 Oct 2005 18:29:32 -0000 1.17 +++ vmadump4/vmadump_common.c 2 Nov 2005 22:38:55 -0000 @@ -924,7 +924,10 @@ filename = default_map_name(map->vm_file, buffer, PAGE_SIZE); head.namelen = strlen(filename); - if (map->vm_flags & VM_IO) { + if ((head.namelen > 10) && !strcmp(filename + head.namelen - 10, " (deleted)")) { + /* Region is a deleted file */ + head.namelen = 0; + } else if (map->vm_flags & VM_IO) { /* Region is an IO map. */ /* Never store the contents of a VM_IO region */