From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Fri Sep 02 2005 - 08:10:13 PDT
Paul, you got it!. If I do "ls -l /proc/PID/exe" as the user who owns the program (in an NFS mounted directory) I saw the link to the file lrwxrwxrwx 1 adolfo adolfo 0 Sep 2 11:45 exe -> /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x but after a littley while I got lrwxrwxrwx 1 adolfo adolfo 0 Sep 2 11:45 exe -> /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) with a blinking link. At the same time, and even before the deleted message if I did as root "ls -l /proc/PID/exe" I got a blinking link. So, the problem was that BLCR was getting the "(deleted)" from the /proc, even if the file wasn't deleted. I changed my export file (after looking in the web for similar problems) from /export 10.0.0.0/255.0.0.0(rw) to /export 10.0.0.0/255.0.0.0(rw,no_root_squash) and it now the problem is solved. My apologies, if this had to be so. But I still do not understand well why this is so (specially why it appears as deleted in /proc). Thanks a lot for your help, adolfo P.S.: I also have many NFS stale file hadle, when using "su" to become root in the mounted home directory, which now are solved with the "no_root_squash". On Thu, 2005-09-01 at 18:27, Paul H. Hargrove wrote: > See my reply at the end. > > Adolfo J. Banchio wrote: > > Paul, > > > > thanks for your answer, I actually would like to comment > > on some of your explanation. see below > [snip] > >> > >>>but when restarting, > >>>also sometimes happens that I get the following > >>>messages in /var/log/messages > >>> > >>> kernel: vmadump: mmap failed: > >>>/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) > >>> > >>> kernel: thaw_threads returned error, aborting. -1 > [snip] > >>What this tells me is that the application has created the file named > >>above and mmaped it. However, at some point *before* the checkpoint was > >>taken the file was deleted (so "--signal 9" won't help). This is a > >>perfectly legal thing to do, and the kernel will remove the directory > >>entry immediately and will delay removing the file contents until the > >>file is no longer mmaped. Unfortunately, that means that by the time we > >>go to restore, the file is gone. > [snip] > > actually the file which is supposed to deleted is the executable, in > > this case, and it is NOT deleted. What I suspect, is that this could > > be related to the fact that this file is in a NFS mounted filesystem. > > If I use the same executable, now place in the local filesystem, I do > > not get this kind of errors (at least not yet). Could this be possible? > > Ah, that is a bit different than I thought. We should not be failing. > > I typically do about 1/2 of my testing on NFS filesystems, so I doubt > that there is a problem here that is a fundamental incompatability w/ > NFS, but I am guessing that NFS may be a contributing factor. > > The string " (deleted)" that appeared in the "mmap failed" message is > actually part of the filename that was saved at checkpoint time. We > can't reopen it under this incorrect name. This string is the result of > the checkpoint code querying the kernel for a name to go with the > internal data structure. It is possible that for some reason the name > information has been expired from some cache, which has been falsely > tagged as deleted. That would also explain why it sometimes did work > but often did not. I'll try to look into how NFS filename lookups might > differ from other filesystems. > > I'd be curious what one gets for "ls -l /proc/<pid>/exe" for the running > program. The proc filesystem should be using the same code as blcr to > get the filename. > > > Thanks in advance, > > > > adolfo > > > -Paul >