From: Ted Cabeen (cabeen_at_chem_dot_ucsb_dot_edu)
Date: Wed Feb 18 2009 - 14:08:54 PST
All right. I've disabled nscd, and we're on to the next problem. (Sorry I missed that note in the FAQ) When I restart a job with enable-restore-ids, I get the following error: - Error -13 from cr_filp_reopen() while restoring external pipe - cr_restore_all_files [3766]: Unable to restore fd 16 (type=4,err=-13) - cr_rstrt_child [3766]: Unable to restore files! (err=-13) Restart failed: Permission denied My suspended jobs start fine when complied without enable-restore-ids. Thoughts? --Ted Paul H. Hargrove wrote: > Ted, > > You are right about the nscd cache file being opened as root (or other > "system" id). The program acquires the file descriptor via fd passing > from a privileged daemon process. Since we can't safely reopen this file > as the user and we are equally unable to reproduce the descriptor > passing from the daemon, BLCR is incompatible with nscd (see FAQ: > http://mantis.lbl.gov/blcr/doc/html/FAQ.html#nscd ). If you were to > perform the restart as the original user you would encounter this > problem regardless of --enable-restore-ids of not. I am afraid the only > known solution is to disable nscd. > > -Paul > > Ted Cabeen wrote: >> I'm having problems with 0.8.0 with --enable-restore-ids. When I try >> to restart a checkpointed job, I get the following error: >> - open('/var/cache/nscd/passwd', 0x0) failed: -13 >> - mmap failed: /var/cache/nscd/passwd >> - thaw_threads returned error, aborting. -13 >> Restart failed: Permission denied >> >> If I recompile 0.8.0 without restore-ids, it doesn't have this error. >> I think that the problem may be that the nscd cache is opened on >> behalf of the program by libc as root, but when BLCR tries to restart >> the checkpointed program as the original user, it can't open the nscd >> cache. Is there a way to fix this? >> >> --Ted > >