jcduell_at_lbl_dot_gov
Date: Fri Jul 23 2004 - 13:07:02 PDT
On Fri, Jul 23, 2004 at 02:51:39PM -0400, Grigory Bronevetsky wrote: > I'm trying to evaluate BLCR by checkpoint some codes from the SPLASH-2 > benchmark suite but I am getting very strange results. In particular, when > I checkpoint radix, which should have at least 800 MB of state, BLCR > produces a checkpoint that is between 237MB and 369MB in size. This > doesn't make sense to me. Are you implementing some kinds of optimizations > like checkpoint compression or page-touch detection? I can't find any > mention of this in the BLCR papers but given the small checkpoint sizes, I > can't find another explanation. We do several optimizations: first, we do not save program text, nor that of shared libraries. This by itself may account for the difference. We also do not save "zero pages", i.e. those that have never been touched and logically contain all 0s (calloc calls, large untouched static arrays, etc.). We don't do any compression. In the future we'll support saving the program text, so that programs can be migrated onto nodes where the program/libraries may not be present. Are your checkpoints restarting? Cheers, -- Jason Duell Future Technologies Group <jcduell_at_lbl_dot_gov> Computational Research Division Tel: +1-510-495-2354 Lawrence Berkeley National Laboratory