From: Adolfo J. Banchio (banchio_at_famaf_dot_unc_dot_edu.ar)
Date: Thu Sep 01 2005 - 14:13:00 PDT
Paul, thanks for your answer, I actually would like to comment on some of your explanation. see below > > I've got modules loaded and I'm testing the BLCR for > > f90 codes compiled by Intel Fortran F90. I can run > > and "cr_checkpoint --term" the code (sometimes it does > > not really kills the job), > > I can be fairly certain that we do send SIGTERM to the process. > However, that is all we do and the process is free to ignore the signal. > One could use '--signal 9' to send an unignorable kill signal, which > would not allow the application to perform any cleanup (but sometimes we > don't want the cleanup, which could delete files needed for the restart). > > > but when restarting, > > also sometimes happens that I get the following > > messages in /var/log/messages > > > > kernel: vmadump: mmap failed: > > /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) > > > > kernel: thaw_threads returned error, aborting. -1 > > > > What this tells me is that the application has created the file named > above and mmaped it. However, at some point *before* the checkpoint was > taken the file was deleted (so "--signal 9" won't help). This is a > perfectly legal thing to do, and the kernel will remove the directory > entry immediately and will delay removing the file contents until the > file is no longer mmaped. Unfortunately, that means that by the time we > go to restore, the file is gone. > > There is very little I can do about this immediately, except to move the > error to checkpoint time to avoid "false hopes" of restarting. > > In the longer term we do plan to explicitly deal with deleted files. > actually the file which is supposed to deleted is the executable, in this case, and it is NOT deleted. What I suspect, is that this could be related to the fact that this file is in a NFS mounted filesystem. If I use the same executable, now place in the local filesystem, I do not get this kind of errors (at least not yet). Could this be possible? Thanks in advance, adolfo