From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 01 2005 - 14:27:34 PDT
See my reply at the end. Adolfo J. Banchio wrote: > Paul, > > thanks for your answer, I actually would like to comment > on some of your explanation. see below [snip] >> >>>but when restarting, >>>also sometimes happens that I get the following >>>messages in /var/log/messages >>> >>> kernel: vmadump: mmap failed: >>>/home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) >>> >>> kernel: thaw_threads returned error, aborting. -1 [snip] >>What this tells me is that the application has created the file named >>above and mmaped it. However, at some point *before* the checkpoint was >>taken the file was deleted (so "--signal 9" won't help). This is a >>perfectly legal thing to do, and the kernel will remove the directory >>entry immediately and will delay removing the file contents until the >>file is no longer mmaped. Unfortunately, that means that by the time we >>go to restore, the file is gone. [snip] > actually the file which is supposed to deleted is the executable, in > this case, and it is NOT deleted. What I suspect, is that this could > be related to the fact that this file is in a NFS mounted filesystem. > If I use the same executable, now place in the local filesystem, I do > not get this kind of errors (at least not yet). Could this be possible? Ah, that is a bit different than I thought. We should not be failing. I typically do about 1/2 of my testing on NFS filesystems, so I doubt that there is a problem here that is a fundamental incompatability w/ NFS, but I am guessing that NFS may be a contributing factor. The string " (deleted)" that appeared in the "mmap failed" message is actually part of the filename that was saved at checkpoint time. We can't reopen it under this incorrect name. This string is the result of the checkpoint code querying the kernel for a name to go with the internal data structure. It is possible that for some reason the name information has been expired from some cache, which has been falsely tagged as deleted. That would also explain why it sometimes did work but often did not. I'll try to look into how NFS filename lookups might differ from other filesystems. I'd be curious what one gets for "ls -l /proc/<pid>/exe" for the running program. The proc filesystem should be using the same code as blcr to get the filename. > Thanks in advance, > > adolfo -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900