From: Ted Cabeen (cabeen_at_chem_dot_ucsb_dot_edu)
Date: Thu Feb 19 2009 - 10:47:33 PST
In this case, I am running BLCR with torque, so I don't have direct access to exactly what filehandles torque has open. Looking in the /proc filesystem when the job is running (not checkpointed), there are three processes, all of which have a fd 16 pointing at the same pipe: 29301/fd/16 -> pipe:[249689] 29302/fd/16 -> pipe:[249689] 29304/fd/16 -> pipe:[249689] Those three processes map to the user's shell, the copy of sh running the user's job script, and the active process of the job (in this case, just a sleep for testing). Is that helpful? --Ted Paul H. Hargrove wrote: > Ted, > Thanks for your patience. The restore-ids code itself seems to be > doing just what it should: dropping the root privilege before performing > any fs permission checks, preventing use of a maliciously modified > checkpoint context file as a way to access otherwise inaccessible > files. However, there seem to be problems with files (ncsd is just one > example) that were originally openned *with* some privilege. Anything > setup by the batch system is a candidate for such problems. > > For the new problem I can see the problem, but don't know enough to > suggest a solution. > The term "external pipe" means that at the time the checkpoint was > taken there was a pipe that had only one endpoint within the scope of > the checkpoint, while the other was not. In a batch scheduled > environment this is often the case for the std{in,out,err} descriptors, > but in your case the error says fd=16, so it must be something else. > When an "external pipe" is encountered in the context file at restart > time, BLCR's behavior is to try to connect this fd to same file as the > stdin or stdout (depending on which end of the pipe is external) of the > cr_restart process. In your case the user has insufficient permission > to do so, most likely because the cr_restart was launched as root and > root owns the file (or device) that are used for std{in,out}. > > Since I don't know what fd 16 was being used for, I can't be certain > that connecting it to stdin or stdout is even the right thing to do. If > it is the right thing, then I will need to go back to looking at the > BLCR source code and determine if the permission checks being performed > (the ones that yield the first error) are required for > correctness/security. My initial thought is that if the stdin/out of > cr_restart have not been marked close-on-exec, then any child it creates > potentially has access to those fds as its own stdin/out and bypassing > fs permissions sounds like the right thing to do. > > So, it is possible that BLCR needs to be doing something different with > respect to the permissions when reopenning external pipes. I will look > into that and get back to you. However, I'd appreciate it if you could > also be looking into figuring out what fd 16 is being used for. It is > entirely possible that it is something that will require a different > approach. > > -Paul > > Ted Cabeen wrote: >> All right. I've disabled nscd, and we're on to the next problem. >> (Sorry I missed that note in the FAQ) When I restart a job with >> enable-restore-ids, I get the following error: >> - Error -13 from cr_filp_reopen() while restoring external pipe >> - cr_restore_all_files [3766]: Unable to restore fd 16 (type=4,err=-13) >> - cr_rstrt_child [3766]: Unable to restore files! (err=-13) >> Restart failed: Permission denied >> >> My suspended jobs start fine when complied without enable-restore-ids. >> >> Thoughts? >> >> --Ted >> >> Paul H. Hargrove wrote: >>> Ted, >>> >>> You are right about the nscd cache file being opened as root (or >>> other "system" id). The program acquires the file descriptor via fd >>> passing from a privileged daemon process. Since we can't safely >>> reopen this file as the user and we are equally unable to reproduce >>> the descriptor passing from the daemon, BLCR is incompatible with >>> nscd (see FAQ: http://mantis.lbl.gov/blcr/doc/html/FAQ.html#nscd ). >>> If you were to perform the restart as the original user you would >>> encounter this problem regardless of --enable-restore-ids of not. I >>> am afraid the only known solution is to disable nscd. >>> >>> -Paul >>> >>> Ted Cabeen wrote: >>>> I'm having problems with 0.8.0 with --enable-restore-ids. When I >>>> try to restart a checkpointed job, I get the following error: >>>> - open('/var/cache/nscd/passwd', 0x0) failed: -13 >>>> - mmap failed: /var/cache/nscd/passwd >>>> - thaw_threads returned error, aborting. -13 >>>> Restart failed: Permission denied >>>> >>>> If I recompile 0.8.0 without restore-ids, it doesn't have this >>>> error. I think that the problem may be that the nscd cache is >>>> opened on behalf of the program by libc as root, but when BLCR tries >>>> to restart the checkpointed program as the original user, it can't >>>> open the nscd cache. Is there a way to fix this? >>>> >>>> --Ted >>> >>> > >