Re: thaw_threads returned error

From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Thu Sep 01 2005 - 14:13:00 PDT

  • Next message: Paul H. Hargrove: "Re: thaw_threads returned error"
    Paul,
    
    thanks for your answer, I actually would like to comment
    on some of your explanation. see below
    
    
    > > I've got modules loaded and I'm testing the BLCR for
    > > f90 codes compiled by Intel Fortran F90. I can run
    > > and "cr_checkpoint --term" the code (sometimes it does
    > > not really kills the job), 
    > 
    > I can be fairly certain that we do send SIGTERM to the process. 
    > However, that is all we do and the process is free to ignore the signal. 
    >   One could use '--signal 9' to send an unignorable kill signal, which 
    > would not allow the application to perform any cleanup (but sometimes we 
    > don't want the cleanup, which could delete files needed for the restart).
    > 
    > > but when restarting,
    > > also sometimes happens that I get the following
    > > messages in /var/log/messages 
    > > 
    > >  kernel: vmadump: mmap failed:
    > > /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    > > 
    > >  kernel: thaw_threads returned error, aborting. -1
    > > 
    > 
    > What this tells me is that the application has created the file named 
    > above and mmaped it.  However, at some point *before* the checkpoint was 
    > taken the file was deleted (so "--signal 9" won't help).  This is a 
    > perfectly legal thing to do, and the kernel will remove the directory 
    > entry immediately and will delay removing the file contents until the 
    > file is no longer mmaped.  Unfortunately, that means that by the time we 
    > go to restore, the file is gone.
    > 
    > There is very little I can do about this immediately, except to move the 
    > error to checkpoint time to avoid "false hopes" of restarting.
    > 
    > In the longer term we do plan to explicitly deal with deleted files.
    > 
    
    actually the file which is supposed to deleted is the executable, in
    this case, and it is NOT deleted. What I suspect, is that this could
    be related to the fact that this file is in a NFS mounted filesystem.
    If I use the same executable, now place in the local filesystem, I do 
    not get this kind of errors  (at least not yet). Could this be possible?
    
    
    
    
    Thanks in advance,
    
    adolfo
    

  • Next message: Paul H. Hargrove: "Re: thaw_threads returned error"