Re: thaw_threads returned error

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 01 2005 - 11:32:09 PDT

Next message: Adolfo J. Banchio: "Re: Unresolved simbols error when trying to install BLCR modules"

Previous message: Paul H. Hargrove: "Re: Unresolved simbols error when trying to install BLCR modules"
In reply to: Adolfo J. Banchio: "thaw_threads returned error"
Next in thread: Adolfo J. Banchio: "Re: thaw_threads returned error"
Reply: Adolfo J. Banchio: "Re: thaw_threads returned error"

There are multiple things going on here.  See below.

Adolfo J. Banchio wrote:
> Hi again,
> 
> I've got modules loaded and I'm testing the BLCR for
> f90 codes compiled by Intel Fortran F90. I can run
> and "cr_checkpoint --term" the code (sometimes it does
> not really kills the job), 

I can be fairly certain that we do send SIGTERM to the process. 
However, that is all we do and the process is free to ignore the signal. 
  One could use '--signal 9' to send an unignorable kill signal, which 
would not allow the application to perform any cleanup (but sometimes we 
don't want the cleanup, which could delete files needed for the restart).

> but when restarting,
> also sometimes happens that I get the following
> messages in /var/log/messages 
> 
>  kernel: vmadump: mmap failed:
> /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
> 
>  kernel: thaw_threads returned error, aborting. -1
> 

What this tells me is that the application has created the file named 
above and mmaped it.  However, at some point *before* the checkpoint was 
taken the file was deleted (so "--signal 9" won't help).  This is a 
perfectly legal thing to do, and the kernel will remove the directory 
entry immediately and will delay removing the file contents until the 
file is no longer mmaped.  Unfortunately, that means that by the time we 
go to restore, the file is gone.

There is very little I can do about this immediately, except to move the 
error to checkpoint time to avoid "false hopes" of restarting.

In the longer term we do plan to explicitly deal with deleted files.

> 
> and the cr_restart stop with "killed".
> After this, if I try again to restart it would give
> "cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy",
> but the PID is free. And in /var/log/messages appears
> 
> kernel: cr_rstrt_request_restart [14041]:  PID conflict found by
> cr_reserve_ids()
> 

This is a sign of a blcr bug.  We probably allocated the pid for the 
process and then failed to de-allocate it when the restart failed due to 
the mmap problem.  Unfortunately, the particular pid is lost until the 
next reboot - though this should not have any bad effect except 
preventing restarting from this particular checkpoint.  I have seen 
similar "lost pids" when I've had more serious restart failures (such as 
a kernel Oops).  In those cases I was not able to track it down, but 
your bug report gives me a way to reproduce this so I can figure out 
where we lose track of the pid.

> This happens no on every checkpoint/restart, but very frequently.

"very frequently" probably means that the few times it did work you got 
lucky that no mmaped-but-deleted files existed at the instant you 
checkpointed.

> thanks in advance for any hint or help.

Thank you for the bug report.  I am sorry that I don't currently have 
any way to help you to checkpoint/restart your application.

-Paul

> adolfo
> 
> 

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Adolfo J. Banchio: "Re: Unresolved simbols error when trying to install BLCR modules"

Previous message: Paul H. Hargrove: "Re: Unresolved simbols error when trying to install BLCR modules"
In reply to: Adolfo J. Banchio: "thaw_threads returned error"
Next in thread: Adolfo J. Banchio: "Re: thaw_threads returned error"
Reply: Adolfo J. Banchio: "Re: thaw_threads returned error"

Date view	Thread view	Subject view	Author view	Attachment view