Re: thaw_threads returned error

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Sep 02 2005 - 10:26:37 PDT

  • Next message: Emmanuel Grumbach: "Open Files"
    Adolfo,
    
      I am glad that it appears to be working for you, but I am just as 
    uncertain as you about WHY it was thinking the file was deleted.  I 
    cannot guess how the root_squash option would be related.
      I am going to consider this a non-BLCR issue, since /proc exhibits the 
    same behavior.  However, if I can figure out why this happens I might be 
    able to work around it (perhaps by forcing NFS to re-fetch missing 
    metadata from the server).
    
    -Paul
    
    Adolfo J. Banchio wrote:
    
    >Paul,
    >
    >you got it!. 
    >
    >If I do "ls -l /proc/PID/exe" as the user who owns the 
    >program (in an NFS mounted directory) I saw the link 
    >to the file 
    >
    >
    >lrwxrwxrwx    1 adolfo   adolfo          0 Sep  2 11:45 exe ->
    >/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x
    >
    >
    >but after a littley while I got
    >
    >
    >lrwxrwxrwx    1 adolfo   adolfo          0 Sep  2 11:45 exe ->
    >/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    >
    >with a blinking link. At the same time, and even before the
    >deleted message if I did as root  "ls -l /proc/PID/exe" I got
    >a blinking link.
    >
    >So, the problem was that BLCR was getting the "(deleted)"
    >from the /proc, even if the file wasn't deleted.
    >
    >
    >
    >I changed my export file (after looking in the web for
    >similar problems) from
    >
    >/export 10.0.0.0/255.0.0.0(rw)
    > 
    >to
    >
    >/export 10.0.0.0/255.0.0.0(rw,no_root_squash)
    >
    >
    >and it now the problem is solved.
    >
    >
    >My apologies, if this had to be so. But I still do not understand
    >well why this is so (specially why it appears as deleted in /proc).
    >
    >
    >Thanks a lot for your help,
    >
    >adolfo
    >
    >
    >P.S.:  I also have many NFS stale file hadle, when using "su"
    >to become root in the mounted home directory, which now are solved 
    >with the "no_root_squash".
    >
    >
    >
    >
    >On Thu, 2005-09-01 at 18:27, Paul H. Hargrove wrote:
    >  
    >
    >>See my reply at the end.
    >>
    >>Adolfo J. Banchio wrote:
    >>    
    >>
    >>>Paul,
    >>>
    >>>thanks for your answer, I actually would like to comment
    >>>on some of your explanation. see below
    >>>      
    >>>
    >>[snip]
    >>    
    >>
    >>>>>but when restarting,
    >>>>>also sometimes happens that I get the following
    >>>>>messages in /var/log/messages 
    >>>>>
    >>>>>kernel: vmadump: mmap failed:
    >>>>>/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    >>>>>
    >>>>>kernel: thaw_threads returned error, aborting. -1
    >>>>>          
    >>>>>
    >>[snip]
    >>    
    >>
    >>>>What this tells me is that the application has created the file named 
    >>>>above and mmaped it.  However, at some point *before* the checkpoint was 
    >>>>taken the file was deleted (so "--signal 9" won't help).  This is a 
    >>>>perfectly legal thing to do, and the kernel will remove the directory 
    >>>>entry immediately and will delay removing the file contents until the 
    >>>>file is no longer mmaped.  Unfortunately, that means that by the time we 
    >>>>go to restore, the file is gone.
    >>>>        
    >>>>
    >>[snip]
    >>    
    >>
    >>>actually the file which is supposed to deleted is the executable, in
    >>>this case, and it is NOT deleted. What I suspect, is that this could
    >>>be related to the fact that this file is in a NFS mounted filesystem.
    >>>If I use the same executable, now place in the local filesystem, I do 
    >>>not get this kind of errors  (at least not yet). Could this be possible?
    >>>      
    >>>
    >>Ah, that is a bit different than I thought.  We should not be failing.
    >>
    >>I typically do about 1/2 of my testing on NFS filesystems, so I doubt 
    >>that there is a problem here that is a fundamental incompatability w/ 
    >>NFS, but I am guessing that NFS may be a contributing factor.
    >>
    >>The string " (deleted)" that appeared in the "mmap failed" message is 
    >>actually part of the filename that was saved at checkpoint time.  We 
    >>can't reopen it under this incorrect name.  This string is the result of 
    >>the checkpoint code querying the kernel for a name to go with the 
    >>internal data structure.  It is possible that for some reason the name 
    >>information has been expired from some cache, which has been falsely 
    >>tagged as deleted.  That would also explain why it sometimes did work 
    >>but often did not.  I'll try to look into how NFS filename lookups might 
    >>differ from other filesystems.
    >>
    >>I'd be curious what one gets for "ls -l /proc/<pid>/exe" for the running 
    >>program.  The proc filesystem should be using the same code as blcr to 
    >>get the filename.
    >>
    >>    
    >>
    >>>Thanks in advance,
    >>>
    >>>adolfo
    >>>      
    >>>
    >>-Paul
    >>
    >>    
    >>
    >
    >  
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Emmanuel Grumbach: "Open Files"