checkpointing processes with >2GB on x86_64

From: Thomas Zeiser (thomas.zeiser_at_rrze.uni-erlangen.de)
Date: Wed Apr 18 2007 - 11:13:24 PDT

  • Next message: Paul H. Hargrove: "Re: checkpointing processes with >2GB on x86_64"
    Dear All,
    
    is there a 2 GB process limit for checkpointing on x86_64??
    
    On your system with
    - SuSE SLES9sp3 x86_64 (kernel contains in addition Voltaire
      Infiniband and Intel VTune modules)
    - blcr-0.5.3 built from source rpm
    - socket nodes with Intel Xeon 5100 ("Woodcrest") CPUs
    - I'm doing the tests from /tmp (formated with reiserfs) using
      cr_run
    
    I observe the following:
    - checkpointing and restarting a process with <2GB total size works
      fine ("simple" sequential Fortran code compiled with Intel 9.1 EM64T
      compilers, no sockets etc. open, just a few plain files)
      => no problems at all.
    
    however, if I increase the working set to >2GB memory footprint
    (i.e. same executable as memory is allocated dynamically)
    - when calling "cr_checkpoint --term PID" the system often starts
      to swap  (e.g. for 5 GB working set on a system with 8 GB RAM)
    - it takes quite long time and suddenly cr_checkpoint disappears
      (with exit code 5 if I've seen it correctly) but no context.### 
      file has been written
    - on STDERR I see
    ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error
    - there are no further messages in dmesg or syslog
    - and the application continues to run (despite --term, but that
      might be fine as no context file is written)
      => no restart for >2GB although OS and application are 64-bit !?
    
    Any ideas? Did I miss something?
    
    
    Regards,
    
    thomas
    -- 
    Thomas ZEISER
    Regionales Rechenzentrum Erlangen
    University of Erlangen-Nuremberg, Germany
    

  • Next message: Paul H. Hargrove: "Re: checkpointing processes with >2GB on x86_64"