jcduell_at_lbl.gov
Date: Thu Jan 02 2003 - 12:38:16 PST
Here is the paper on our checkpoint/restart system that I wrote for my class last semester. It's mainly just a high-level overview of the architecture, but there are a couple things worth noting: 1) I gave us an acroynm: BLCR (Berkeley Lab Linux Checkpoint/Restart). If we hate it, we can change it. 2) I refer in the paper to 'signal-based' and 'thread-based' checkpoint 'callbacks' rather than our current, utterly confusing nomenclature of 'synchronous and asynchronous handlers." Calling a function that gets run in signal handler context 'synchronous' is just confusing, and calling user functions 'handlers' makes them sound too much like signal handlers. I propose we change our docs and APIs, too. 3) I have some performance numbers, which are more interesting than expected (we suck badly at checkpointing jobs whose VM size is half or more of the physical memory size, and we'll need to fix it). 4) At least one of the ideas I discuss in the optimizations section is new (or at least I don't recall ever discussing it with anyone). There are some typos and unfilled-in references in the paper, which I'm going to fix at some point. But I want to get this out to you all now, before the real work of the new year diverts your attention. Cheers, -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> High Performance Computing Research Dept. Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory