Re: thaw_threads returned error

From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Fri Sep 02 2005 - 08:10:13 PDT

  • Next message: Adolfo J. Banchio: "Re: Unresolved simbols error when trying to install BLCR modules"
    Paul,
    
    you got it!. 
    
    If I do "ls -l /proc/PID/exe" as the user who owns the 
    program (in an NFS mounted directory) I saw the link 
    to the file 
    
    
    lrwxrwxrwx    1 adolfo   adolfo          0 Sep  2 11:45 exe ->
    /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x
    
    
    but after a littley while I got
    
    
    lrwxrwxrwx    1 adolfo   adolfo          0 Sep  2 11:45 exe ->
    /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    
    with a blinking link. At the same time, and even before the
    deleted message if I did as root  "ls -l /proc/PID/exe" I got
    a blinking link.
    
    So, the problem was that BLCR was getting the "(deleted)"
    from the /proc, even if the file wasn't deleted.
    
    
    
    I changed my export file (after looking in the web for
    similar problems) from
    
    /export 10.0.0.0/255.0.0.0(rw)
     
    to
    
    /export 10.0.0.0/255.0.0.0(rw,no_root_squash)
    
    
    and it now the problem is solved.
    
    
    My apologies, if this had to be so. But I still do not understand
    well why this is so (specially why it appears as deleted in /proc).
    
    
    Thanks a lot for your help,
    
    adolfo
    
    
    P.S.:  I also have many NFS stale file hadle, when using "su"
    to become root in the mounted home directory, which now are solved 
    with the "no_root_squash".
    
    
    
    
    On Thu, 2005-09-01 at 18:27, Paul H. Hargrove wrote:
    > See my reply at the end.
    > 
    > Adolfo J. Banchio wrote:
    > > Paul,
    > > 
    > > thanks for your answer, I actually would like to comment
    > > on some of your explanation. see below
    > [snip]
    > >>
    > >>>but when restarting,
    > >>>also sometimes happens that I get the following
    > >>>messages in /var/log/messages 
    > >>>
    > >>> kernel: vmadump: mmap failed:
    > >>>/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    > >>>
    > >>> kernel: thaw_threads returned error, aborting. -1
    > [snip]
    > >>What this tells me is that the application has created the file named 
    > >>above and mmaped it.  However, at some point *before* the checkpoint was 
    > >>taken the file was deleted (so "--signal 9" won't help).  This is a 
    > >>perfectly legal thing to do, and the kernel will remove the directory 
    > >>entry immediately and will delay removing the file contents until the 
    > >>file is no longer mmaped.  Unfortunately, that means that by the time we 
    > >>go to restore, the file is gone.
    > [snip]
    > > actually the file which is supposed to deleted is the executable, in
    > > this case, and it is NOT deleted. What I suspect, is that this could
    > > be related to the fact that this file is in a NFS mounted filesystem.
    > > If I use the same executable, now place in the local filesystem, I do 
    > > not get this kind of errors  (at least not yet). Could this be possible?
    > 
    > Ah, that is a bit different than I thought.  We should not be failing.
    > 
    > I typically do about 1/2 of my testing on NFS filesystems, so I doubt 
    > that there is a problem here that is a fundamental incompatability w/ 
    > NFS, but I am guessing that NFS may be a contributing factor.
    > 
    > The string " (deleted)" that appeared in the "mmap failed" message is 
    > actually part of the filename that was saved at checkpoint time.  We 
    > can't reopen it under this incorrect name.  This string is the result of 
    > the checkpoint code querying the kernel for a name to go with the 
    > internal data structure.  It is possible that for some reason the name 
    > information has been expired from some cache, which has been falsely 
    > tagged as deleted.  That would also explain why it sometimes did work 
    > but often did not.  I'll try to look into how NFS filename lookups might 
    > differ from other filesystems.
    > 
    > I'd be curious what one gets for "ls -l /proc/<pid>/exe" for the running 
    > program.  The proc filesystem should be using the same code as blcr to 
    > get the filename.
    > 
    > > Thanks in advance,
    > > 
    > > adolfo
    > 
    > 
    > -Paul
    > 
    

  • Next message: Adolfo J. Banchio: "Re: Unresolved simbols error when trying to install BLCR modules"