From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 01 2005 - 11:32:09 PDT
There are multiple things going on here. See below. Adolfo J. Banchio wrote: > Hi again, > > I've got modules loaded and I'm testing the BLCR for > f90 codes compiled by Intel Fortran F90. I can run > and "cr_checkpoint --term" the code (sometimes it does > not really kills the job), I can be fairly certain that we do send SIGTERM to the process. However, that is all we do and the process is free to ignore the signal. One could use '--signal 9' to send an unignorable kill signal, which would not allow the application to perform any cleanup (but sometimes we don't want the cleanup, which could delete files needed for the restart). > but when restarting, > also sometimes happens that I get the following > messages in /var/log/messages > > kernel: vmadump: mmap failed: > /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted) > > kernel: thaw_threads returned error, aborting. -1 > What this tells me is that the application has created the file named above and mmaped it. However, at some point *before* the checkpoint was taken the file was deleted (so "--signal 9" won't help). This is a perfectly legal thing to do, and the kernel will remove the directory entry immediately and will delay removing the file contents until the file is no longer mmaped. Unfortunately, that means that by the time we go to restore, the file is gone. There is very little I can do about this immediately, except to move the error to checkpoint time to avoid "false hopes" of restarting. In the longer term we do plan to explicitly deal with deleted files. > > and the cr_restart stop with "killed". > After this, if I try again to restart it would give > "cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy", > but the PID is free. And in /var/log/messages appears > > kernel: cr_rstrt_request_restart [14041]: PID conflict found by > cr_reserve_ids() > This is a sign of a blcr bug. We probably allocated the pid for the process and then failed to de-allocate it when the restart failed due to the mmap problem. Unfortunately, the particular pid is lost until the next reboot - though this should not have any bad effect except preventing restarting from this particular checkpoint. I have seen similar "lost pids" when I've had more serious restart failures (such as a kernel Oops). In those cases I was not able to track it down, but your bug report gives me a way to reproduce this so I can figure out where we lose track of the pid. > This happens no on every checkpoint/restart, but very frequently. "very frequently" probably means that the few times it did work you got lucky that no mmaped-but-deleted files existed at the instant you checkpointed. > thanks in advance for any hint or help. Thank you for the bug report. I am sorry that I don't currently have any way to help you to checkpoint/restart your application. -Paul > adolfo > > -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900