Re: cr_checkpoint error

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Feb 29 2008 - 11:29:26 PST

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.6.5"
    Yuan,
      The status value 1 is EPERM on Linux; therefore the message "Operation 
    not permitted".
      For security reasons, the rules for when you can checkpoint a process 
    are the same as the rules that govern getting a coredump by sending SIGABRT:
    1) You must have the same effective uid as the process to checkpoint, or 
    be root
    2) The process must not be running a setuid/setgid executable
    3) The process must be a "real" process (not a kernel thread)
    
    The default "scope" of a checkpoint is the process you specify plus all 
    its decedents (a process "tree").  If your checkpoint request included 
    any processes that didn't meet the requirements listed above, then you 
    will see EPERM.
    
    There is also a possibility that you may have processes doing a lot of 
    fork/exit.  If that is the case, then in some instances you may see 
    EPERM when a target process is exiting (because it temporarily looks to 
    BLCR like a kernel thread).  However, that should be very rare.
    
    If this doesn't help you resolve or explain your problem, let us know 
    and we'll see what we can do.
    -Paul
    
    Yuan Wan wrote:
    >
    > Hi all,
    >
    > I get the following error messege during my cross-node 
    > checkpoint/restart test:
    > --------------------------------------------------------------------------- 
    >
    > Checkpoint command: cr_checkpoint -f context_4644030.2 --run 10983
    > ioctl(/proc/checkpoint/ctrl,  CR_OP_CHKPT_REQ): Operation not permitted
    > --------------------------------------------------------------------------- 
    >
    >
    > the status value returned by this operation is 1 ranther than 0
    >
    > This error appears randomly on some nodes for some jobs, but the same 
    > checkpoint operation of other jobs which are exactly of same codes 
    > works fine.
    >
    > Can anyone explain this error?
    >
    > --Yuan
    >
    > Yuan Wan
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Announcing the release of BLCR 0.6.5"