From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 24 2005 - 11:29:15 PST
Michael, Thanks for the bug report. Since you report that the machine requires a reset, I am certain the process is stuck in the kernel and your example greatly narrows down where the problem lies. When performing a checkpoint or a restart, we keep a count of the number of threads in the process and how many of them have responded to the checkpoint to ensure they are all idle when we start writting the checkpoint file. Because one thread (the one running the aborting callback) has exited, the counts will never be equal. We deal with this possibility at checkpoint time by having a "watchdog" task that wakes once per minute to look for tasks that are part of a checkpoint but have exited. If any are found then we adjust the thread counts. We don't currently do this for a restart, but we probably should. I am uncertain about why this worked in a 2.4 kernel but not a 2.6. This is the first thing I will look into. The one thing that confuses me is the fact that some uniterruptible process was consuming 100% of CPU. There should be no spin waits in the kernel's checkpoint or restart code paths, so this may be an indication of a larger problem. There is a spinwait for thread synchronization in user space that you may have encountered. At that point the process probably has all signals blocked, making it *almost* uninterruptible. Could you please determine if sending SIGKILL (an unblockable signal) is capable of killing the cpu-consuming process? That would narrow down some things for me. Michael Klemm wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > > playing aroud I found the following bug: > > | DONE > | FINISHED chkpt_callback > | entering level 6 (return address 0x80486f7) > | entering level 7 (return address 0x80486f7) > | cr_core.c:467 cr_checkpoint: Callback 0 returned 1 - ABORTING > | > | entering level 8 (return address 0x80486f7) > > After printing the error message, the process freezes and continuously > wastes 100% CPU and is uninterruptable. Also, I'm not able to shutdown > the machine. Instead, I'm forced to press the machine's reset button. > > The machine is as P4 3.06 HT DUAL running SuSE Linux, kernel version > 2.6.5-7.145-smp. I checked the same program on my other Linux box > running kernel 2.4.29. On this machine, BLCR works fine although it also > reports an aborted restart process (that's the correct behavior). > > Regards > -michael > > - -- > Computer Science Department 2, University of Erlangen-Nuremberg > Martensstrasse 3, D-91058 Erlangen, Germany > phone: ++49 (0)9131 85-28995, fax: ++49 (0)9131 85-28809 > web: http://www2.informatik.uni-erlangen.de/~klemm > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.2.4 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFCHcy9WEu1syWqdn0RAq78AKCwZBky/zTtkJjjp4sFO1V3k6jNFQCffIiP > SFsXyOKmAAQ0f0XTfB4O3hE= > =i3zw > -----END PGP SIGNATURE----- > > > ------------------------------------------------------------------------ > > #include <stdio.h> > #include <string.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <unistd.h> > #include "libcr.h" > > int recursive(int level) { > int i; > int result; > for(i = 0; i < level * 2; i++) { > fprintf(stderr, " "); > } > fprintf(stderr, "entering level %d (return address 0x%x)\n", > level, __builtin_return_address(0)); > > if (level == 10) { > result = 1; > } > else { > if (level == 5) { > fprintf(stderr, "WAITING FOR USER TO CHECKPOINT...\n"); > sleep(60); > fprintf(stderr, "DONE\n"); > } > result = recursive(level+1) + level; > } > > for(i = 0; i < level * 2; i++) { > fprintf(stderr, " "); > } > fprintf(stderr, "leaving level %d\n", level); > > return result; > } > > #if 0 > int chkpt_callback(void *arg) { > fprintf(stderr, "BLCR CALLED %s(0x%x)\n", __FUNCTION__, (unsigned int)arg); > int result = cr_checkpoint(CR_CHECKPOINT_READY); > fprintf(stderr, "FINISHED %s\n", __FUNCTION__); > if(result > 0) > return 0; > return result; > } > #endif > > int chkpt_callback(void *arg) { > fprintf(stderr, "BLCR CALLED %s(0x%x)\n", __FUNCTION__, (unsigned int)arg); > int result = cr_checkpoint(CR_CHECKPOINT_READY); > fprintf(stderr, "FINISHED %s\n", __FUNCTION__); > return result; > } > > int main(int argc, char **argv) { > cr_init(); > cr_register_callback(chkpt_callback, NULL, CR_THREAD_CONTEXT); > > fprintf(stderr, "MY PID IS %d\n", getpid()); > > fprintf(stderr, "\n\nresult: %d\n", recursive(0)); > > return 0; > } -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900