Re: thaw_threads returned error

From: Adolfo J. Banchio (
Date: Thu Sep 01 2005 - 14:13:00 PDT

    thanks for your answer, I actually would like to comment
    on some of your explanation. see below
    > > I've got modules loaded and I'm testing the BLCR for
    > > f90 codes compiled by Intel Fortran F90. I can run
    > > and "cr_checkpoint --term" the code (sometimes it does
    > > not really kills the job), 
    > I can be fairly certain that we do send SIGTERM to the process. 
    > However, that is all we do and the process is free to ignore the signal. 
    >   One could use '--signal 9' to send an unignorable kill signal, which 
    > would not allow the application to perform any cleanup (but sometimes we 
    > don't want the cleanup, which could delete files needed for the restart).
    > > but when restarting,
    > > also sometimes happens that I get the following
    > > messages in /var/log/messages 
    > > 
    > >  kernel: vmadump: mmap failed:
    > > /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    > > 
    > >  kernel: thaw_threads returned error, aborting. -1
    > > 
    > What this tells me is that the application has created the file named 
    > above and mmaped it.  However, at some point *before* the checkpoint was 
    > taken the file was deleted (so "--signal 9" won't help).  This is a 
    > perfectly legal thing to do, and the kernel will remove the directory 
    > entry immediately and will delay removing the file contents until the 
    > file is no longer mmaped.  Unfortunately, that means that by the time we 
    > go to restore, the file is gone.
    > There is very little I can do about this immediately, except to move the 
    > error to checkpoint time to avoid "false hopes" of restarting.
    > In the longer term we do plan to explicitly deal with deleted files.
    actually the file which is supposed to deleted is the executable, in
    this case, and it is NOT deleted. What I suspect, is that this could
    be related to the fact that this file is in a NFS mounted filesystem.
    If I use the same executable, now place in the local filesystem, I do 
    not get this kind of errors  (at least not yet). Could this be possible?
    Thanks in advance,

