From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Apr 18 2007 - 14:26:07 PDT
Thomas Zeiser wrote: > Dear All, > > is there a 2 GB process limit for checkpointing on x86_64?? There is not any intentional limit or technical limitation. It is possible that you've encountered a BLCR bug. > > On your system with > - SuSE SLES9sp3 x86_64 (kernel contains in addition Voltaire > Infiniband and Intel VTune modules) > - blcr-0.5.3 built from source rpm > - socket nodes with Intel Xeon 5100 ("Woodcrest") CPUs > - I'm doing the tests from /tmp (formated with reiserfs) using > cr_run > > I observe the following: > - checkpointing and restarting a process with <2GB total size works > fine ("simple" sequential Fortran code compiled with Intel 9.1 EM64T > compilers, no sockets etc. open, just a few plain files) > => no problems at all. > > however, if I increase the working set to >2GB memory footprint > (i.e. same executable as memory is allocated dynamically) > - when calling "cr_checkpoint --term PID" the system often starts > to swap (e.g. for 5 GB working set on a system with 8 GB RAM) The swapping is "normal" for application working sets larger than about 1/2 of physical memory, as the dump process will end up creating I/O buffers of equal volume. We hope to work around that in the future. > - it takes quite long time and suddenly cr_checkpoint disappears > (with exit code 5 if I've seen it correctly) but no context.### > file has been written The long time is probably the swapping. No file is written because cr_checkpoint is witting to a temporary file that it renamed on success, but unlinked on error. There is currently no way to keep the file on error. The exit code 5 corresponds to errno=EIO, consistent w/ the message on STDERR. > - on STDERR I see > ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error > - there are no further messages in dmesg or syslog > - and the application continues to run (despite --term, but that > might be fine as no context file is written) > => no restart for >2GB although OS and application are 64-bit !? The lack of a context file *is* why the app continues to run. > Any ideas? Did I miss something? The first thing that comes to mind is to check for rlimit problems. Run "ulimit -a" for a bourne-type shell, or "limit" for a C-shell. Check the "filesize" limit to see if it is anything other than "unlimited". > > > Regards, > > thomas -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900