From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Feb 18 2009 - 15:16:12 PST
Ted, Thanks for your patience. The restore-ids code itself seems to be doing just what it should: dropping the root privilege before performing any fs permission checks, preventing use of a maliciously modified checkpoint context file as a way to access otherwise inaccessible files. However, there seem to be problems with files (ncsd is just one example) that were originally openned *with* some privilege. Anything setup by the batch system is a candidate for such problems. For the new problem I can see the problem, but don't know enough to suggest a solution. The term "external pipe" means that at the time the checkpoint was taken there was a pipe that had only one endpoint within the scope of the checkpoint, while the other was not. In a batch scheduled environment this is often the case for the std{in,out,err} descriptors, but in your case the error says fd=16, so it must be something else. When an "external pipe" is encountered in the context file at restart time, BLCR's behavior is to try to connect this fd to same file as the stdin or stdout (depending on which end of the pipe is external) of the cr_restart process. In your case the user has insufficient permission to do so, most likely because the cr_restart was launched as root and root owns the file (or device) that are used for std{in,out}. Since I don't know what fd 16 was being used for, I can't be certain that connecting it to stdin or stdout is even the right thing to do. If it is the right thing, then I will need to go back to looking at the BLCR source code and determine if the permission checks being performed (the ones that yield the first error) are required for correctness/security. My initial thought is that if the stdin/out of cr_restart have not been marked close-on-exec, then any child it creates potentially has access to those fds as its own stdin/out and bypassing fs permissions sounds like the right thing to do. So, it is possible that BLCR needs to be doing something different with respect to the permissions when reopenning external pipes. I will look into that and get back to you. However, I'd appreciate it if you could also be looking into figuring out what fd 16 is being used for. It is entirely possible that it is something that will require a different approach. -Paul Ted Cabeen wrote: > All right. I've disabled nscd, and we're on to the next problem. > (Sorry I missed that note in the FAQ) When I restart a job with > enable-restore-ids, I get the following error: > - Error -13 from cr_filp_reopen() while restoring external pipe > - cr_restore_all_files [3766]: Unable to restore fd 16 (type=4,err=-13) > - cr_rstrt_child [3766]: Unable to restore files! (err=-13) > Restart failed: Permission denied > > My suspended jobs start fine when complied without enable-restore-ids. > > Thoughts? > > --Ted > > Paul H. Hargrove wrote: >> Ted, >> >> You are right about the nscd cache file being opened as root (or >> other "system" id). The program acquires the file descriptor via fd >> passing from a privileged daemon process. Since we can't safely >> reopen this file as the user and we are equally unable to reproduce >> the descriptor passing from the daemon, BLCR is incompatible with >> nscd (see FAQ: http://mantis.lbl.gov/blcr/doc/html/FAQ.html#nscd ). >> If you were to perform the restart as the original user you would >> encounter this problem regardless of --enable-restore-ids of not. I >> am afraid the only known solution is to disable nscd. >> >> -Paul >> >> Ted Cabeen wrote: >>> I'm having problems with 0.8.0 with --enable-restore-ids. When I >>> try to restart a checkpointed job, I get the following error: >>> - open('/var/cache/nscd/passwd', 0x0) failed: -13 >>> - mmap failed: /var/cache/nscd/passwd >>> - thaw_threads returned error, aborting. -13 >>> Restart failed: Permission denied >>> >>> If I recompile 0.8.0 without restore-ids, it doesn't have this >>> error. I think that the problem may be that the nscd cache is >>> opened on behalf of the program by libc as root, but when BLCR tries >>> to restart the checkpointed program as the original user, it can't >>> open the nscd cache. Is there a way to fix this? >>> >>> --Ted >> >> -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900