From: Thomas Zeiser (thomas.zeiser_at_rrze.uni-erlangen.de)
Date: Wed Apr 18 2007 - 11:13:24 PDT
Dear All, is there a 2 GB process limit for checkpointing on x86_64?? On your system with - SuSE SLES9sp3 x86_64 (kernel contains in addition Voltaire Infiniband and Intel VTune modules) - blcr-0.5.3 built from source rpm - socket nodes with Intel Xeon 5100 ("Woodcrest") CPUs - I'm doing the tests from /tmp (formated with reiserfs) using cr_run I observe the following: - checkpointing and restarting a process with <2GB total size works fine ("simple" sequential Fortran code compiled with Intel 9.1 EM64T compilers, no sockets etc. open, just a few plain files) => no problems at all. however, if I increase the working set to >2GB memory footprint (i.e. same executable as memory is allocated dynamically) - when calling "cr_checkpoint --term PID" the system often starts to swap (e.g. for 5 GB working set on a system with 8 GB RAM) - it takes quite long time and suddenly cr_checkpoint disappears (with exit code 5 if I've seen it correctly) but no context.### file has been written - on STDERR I see ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error - there are no further messages in dmesg or syslog - and the application continues to run (despite --term, but that might be fine as no context file is written) => no restart for >2GB although OS and application are 64-bit !? Any ideas? Did I miss something? Regards, thomas -- Thomas ZEISER Regionales Rechenzentrum Erlangen University of Erlangen-Nuremberg, Germany