From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Sep 02 2005 - 10:26:37 PDT
Adolfo, I am glad that it appears to be working for you, but I am just as uncertain as you about WHY it was thinking the file was deleted. I cannot guess how the root_squash option would be related. I am going to consider this a non-BLCR issue, since /proc exhibits the same behavior. However, if I can figure out why this happens I might be able to work around it (perhaps by forcing NFS to re-fetch missing metadata from the server). -Paul Adolfo J. Banchio wrote: >Paul, > >you got it!. > >If I do "ls -l /proc/PID/exe" as the user who owns the >program (in an NFS mounted directory) I saw the link >to the file > > >lrwxrwxrwx 1 adolfo adolfo 0 Sep 2 11:45 exe -> >/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x > > >but after a littley while I got > > >lrwxrwxrwx 1 adolfo adolfo 0 Sep 2 11:45 exe -> >/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) > >with a blinking link. At the same time, and even before the >deleted message if I did as root "ls -l /proc/PID/exe" I got >a blinking link. > >So, the problem was that BLCR was getting the "(deleted)" >from the /proc, even if the file wasn't deleted. > > > >I changed my export file (after looking in the web for >similar problems) from > >/export 10.0.0.0/255.0.0.0(rw) > >to > >/export 10.0.0.0/255.0.0.0(rw,no_root_squash) > > >and it now the problem is solved. > > >My apologies, if this had to be so. But I still do not understand >well why this is so (specially why it appears as deleted in /proc). > > >Thanks a lot for your help, > >adolfo > > >P.S.: I also have many NFS stale file hadle, when using "su" >to become root in the mounted home directory, which now are solved >with the "no_root_squash". > > > > >On Thu, 2005-09-01 at 18:27, Paul H. Hargrove wrote: > > >>See my reply at the end. >> >>Adolfo J. Banchio wrote: >> >> >>>Paul, >>> >>>thanks for your answer, I actually would like to comment >>>on some of your explanation. see below >>> >>> >>[snip] >> >> >>>>>but when restarting, >>>>>also sometimes happens that I get the following >>>>>messages in /var/log/messages >>>>> >>>>>kernel: vmadump: mmap failed: >>>>>/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) >>>>> >>>>>kernel: thaw_threads returned error, aborting. -1 >>>>> >>>>> >>[snip] >> >> >>>>What this tells me is that the application has created the file named >>>>above and mmaped it. However, at some point *before* the checkpoint was >>>>taken the file was deleted (so "--signal 9" won't help). This is a >>>>perfectly legal thing to do, and the kernel will remove the directory >>>>entry immediately and will delay removing the file contents until the >>>>file is no longer mmaped. Unfortunately, that means that by the time we >>>>go to restore, the file is gone. >>>> >>>> >>[snip] >> >> >>>actually the file which is supposed to deleted is the executable, in >>>this case, and it is NOT deleted. What I suspect, is that this could >>>be related to the fact that this file is in a NFS mounted filesystem. >>>If I use the same executable, now place in the local filesystem, I do >>>not get this kind of errors (at least not yet). Could this be possible? >>> >>> >>Ah, that is a bit different than I thought. We should not be failing. >> >>I typically do about 1/2 of my testing on NFS filesystems, so I doubt >>that there is a problem here that is a fundamental incompatability w/ >>NFS, but I am guessing that NFS may be a contributing factor. >> >>The string " (deleted)" that appeared in the "mmap failed" message is >>actually part of the filename that was saved at checkpoint time. We >>can't reopen it under this incorrect name. This string is the result of >>the checkpoint code querying the kernel for a name to go with the >>internal data structure. It is possible that for some reason the name >>information has been expired from some cache, which has been falsely >>tagged as deleted. That would also explain why it sometimes did work >>but often did not. I'll try to look into how NFS filename lookups might >>differ from other filesystems. >> >>I'd be curious what one gets for "ls -l /proc/<pid>/exe" for the running >>program. The proc filesystem should be using the same code as blcr to >>get the filename. >> >> >> >>>Thanks in advance, >>> >>>adolfo >>> >>> >>-Paul >> >> >> > > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900