From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri Feb 29 2008 - 11:29:26 PST
Yuan, The status value 1 is EPERM on Linux; therefore the message "Operation not permitted". For security reasons, the rules for when you can checkpoint a process are the same as the rules that govern getting a coredump by sending SIGABRT: 1) You must have the same effective uid as the process to checkpoint, or be root 2) The process must not be running a setuid/setgid executable 3) The process must be a "real" process (not a kernel thread) The default "scope" of a checkpoint is the process you specify plus all its decedents (a process "tree"). If your checkpoint request included any processes that didn't meet the requirements listed above, then you will see EPERM. There is also a possibility that you may have processes doing a lot of fork/exit. If that is the case, then in some instances you may see EPERM when a target process is exiting (because it temporarily looks to BLCR like a kernel thread). However, that should be very rare. If this doesn't help you resolve or explain your problem, let us know and we'll see what we can do. -Paul Yuan Wan wrote: > > Hi all, > > I get the following error messege during my cross-node > checkpoint/restart test: > --------------------------------------------------------------------------- > > Checkpoint command: cr_checkpoint -f context_4644030.2 --run 10983 > ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REQ): Operation not permitted > --------------------------------------------------------------------------- > > > the status value returned by this operation is 1 ranther than 0 > > This error appears randomly on some nodes for some jobs, but the same > checkpoint operation of other jobs which are exactly of same codes > works fine. > > Can anyone explain this error? > > --Yuan > > Yuan Wan -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900