From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri May 02 2008 - 16:30:31 PDT
Gijsbert Wiesenekker wrote: > Paul, > > I have a question that I could not find easily in the documentation. > When I checkpoint a task that uses almost all physical memory (8GB), > taking a checkpoint takes about three hours (and the system becomes > almost unusable). Obviously this is because taking the checkpoint pushes > the system beyond it's memory limits. > How much memory is needed to make a checkpoint? > > Regards, > Gijsbert [snip] Gijsbert, In testing we've done in the past, we've measured checkpoint time as a function of application memory size. What we found is that time was roughly linear with memory until the memory exceeded about 5/8 of physical memory (so 5GB on your 8GB machine). Beyond that level of memory usage, the time grew faster than linearly (though we never ran any 3 hr tests cases). So, I'd recommend keeping the app's usage below 3/4 as a rule of thumb. However, the actual behavior seemed to vary with kernel version, presumably due to changes in memory management policy. The very simple code we used to run these tests is in the directory examples/io_bench of the BLCR sources. The executable takes 1 argument: the memory size in MB, and reports the time to checkpoint. We hope in the future to be able to take some advantage of O_DIRECT to avoid the buffer management that pushes the things so hard when using a large fraction of physical memory. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900