jcduell_at_lbl.gov
Date: Wed Feb 19 2003 - 13:43:45 PST
At long last, I've added kernel tracing to our checkpoint module. The most noticeable immeditate effect is that our module no longer verbigerates incessantly into the system log everytime you do a checkpoint or restart. Instead, you have to enable its verbigerations (via './configure --enable-kernel-tracing'). The system now has 3 different types of ways of printing a message: try to use the right one for new messages (I've changed all the existing printk's to use whichever seemed appropriate): 1) Messages that should always be printed. These are mainly messages that indicate an internal logic error, or something equally horrible, has happended. There is now a set of CR_ERR/CR_WARN/CR_INFO macros for these. 2) Permanent statements in the code that print at some event that you might want to trace. There are now a batch of different tracing event types, each with its own macro. So I've got CR_KTRACE_FUNC_ENTRY/EXIT, which you can turn on if you want to see a tracing message every time a function is entered/exited (assuming you've added a tracing message to each function: not all functions have them right now). This event is not one of the ones that is on by default. The events that seemed to merit being on by default were "high-level" events ("phase 2 entered", etc.), bad parameter or system limit warnings, and "unexpected" events ("can't restore PID"). 3) As a special case, I've created a CR_KTRACE_DEBUG() macro that is intended to be used only during debugging, i.e., you shouldn't check in code that has the macro still in it. Just use it to find your own immediate bug. This one is also on by default. All the macros are like printf in that you can pass them a format string and parameters. You don't need to pass in a string, though, if there's no need to (like for the function entry/exit macros). The function name, file, line number, and pid are all printed as part of each macro. I may have gone into overkill mode in the number of different events I came up with, but what the hell. It shouldn't do any harm. I'm trying to make sure I actually spend 1/2 my time this semester on checkpoint/restart. I figure the next things I ought to work on are 1) documentation: we should talk about what we want done, and what format to use 2) Getting UML to work with VMADump. I think it would save us a lot of development time going forward if we could use gdb... 3) Getting VMADump to not buffer pages as they are written. This isn't as big a priority as getting file handles to work (which I assume Eric is going to do), but it's close. People will want jobs that take up more than 1/2 the RAM on a machine to checkpoint in a reasonable amount of time... Perhaps we ought to have one of our famous "checkpoint club" meetings to set up a timeline for our development and the 1st release... -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> High Performance Computing Research Dept. Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory