Re: thaw_threads returned error

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 01 2005 - 14:27:34 PDT

  • Next message: Paul H. Hargrove: "Re: Unresolved simbols error when trying to install BLCR modules"
    See my reply at the end.
    Adolfo J. Banchio wrote:
    > Paul,
    > thanks for your answer, I actually would like to comment
    > on some of your explanation. see below
    >>>but when restarting,
    >>>also sometimes happens that I get the following
    >>>messages in /var/log/messages 
    >>> kernel: vmadump: mmap failed:
    >>>/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    >>> kernel: thaw_threads returned error, aborting. -1
    >>What this tells me is that the application has created the file named 
    >>above and mmaped it.  However, at some point *before* the checkpoint was 
    >>taken the file was deleted (so "--signal 9" won't help).  This is a 
    >>perfectly legal thing to do, and the kernel will remove the directory 
    >>entry immediately and will delay removing the file contents until the 
    >>file is no longer mmaped.  Unfortunately, that means that by the time we 
    >>go to restore, the file is gone.
    > actually the file which is supposed to deleted is the executable, in
    > this case, and it is NOT deleted. What I suspect, is that this could
    > be related to the fact that this file is in a NFS mounted filesystem.
    > If I use the same executable, now place in the local filesystem, I do 
    > not get this kind of errors  (at least not yet). Could this be possible?
    Ah, that is a bit different than I thought.  We should not be failing.
    I typically do about 1/2 of my testing on NFS filesystems, so I doubt 
    that there is a problem here that is a fundamental incompatability w/ 
    NFS, but I am guessing that NFS may be a contributing factor.
    The string " (deleted)" that appeared in the "mmap failed" message is 
    actually part of the filename that was saved at checkpoint time.  We 
    can't reopen it under this incorrect name.  This string is the result of 
    the checkpoint code querying the kernel for a name to go with the 
    internal data structure.  It is possible that for some reason the name 
    information has been expired from some cache, which has been falsely 
    tagged as deleted.  That would also explain why it sometimes did work 
    but often did not.  I'll try to look into how NFS filename lookups might 
    differ from other filesystems.
    I'd be curious what one gets for "ls -l /proc/<pid>/exe" for the running 
    program.  The proc filesystem should be using the same code as blcr to 
    get the filename.
    > Thanks in advance,
    > adolfo
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: Paul H. Hargrove: "Re: Unresolved simbols error when trying to install BLCR modules"