Re: Problems with --enable-restore-ids

From: Ted Cabeen (cabeen_at_chem_dot_ucsb_dot_edu)
Date: Wed Feb 18 2009 - 14:08:54 PST

    All right.  I've disabled nscd, and we're on to the next problem. 
    (Sorry I missed that note in the FAQ)  When I restart a job with 
    enable-restore-ids, I get the following error:
    - Error -13 from cr_filp_reopen() while restoring external pipe
    - cr_restore_all_files [3766]:  Unable to restore fd 16 (type=4,err=-13)
    - cr_rstrt_child [3766]:  Unable to restore files!  (err=-13)
    Restart failed: Permission denied
    My suspended jobs start fine when complied without enable-restore-ids.
    Paul H. Hargrove wrote:
    > Ted,
    >  You are right about the nscd cache file being opened as root (or other 
    > "system" id).  The program acquires the file descriptor via fd passing 
    > from a privileged daemon process. Since we can't safely reopen this file 
    > as the user and we are equally unable to reproduce the descriptor 
    > passing from the daemon, BLCR is incompatible with nscd (see FAQ:  
    > ).  If you were to 
    > perform the restart as the original user you would encounter this 
    > problem regardless of --enable-restore-ids of not.  I am afraid the only 
    > known solution is to disable nscd.
    > -Paul
    > Ted Cabeen wrote:
    >> I'm having problems with 0.8.0 with --enable-restore-ids.  When I try 
    >> to restart a checkpointed job, I get the following error:
    >> - open('/var/cache/nscd/passwd', 0x0) failed: -13
    >> - mmap failed: /var/cache/nscd/passwd
    >> - thaw_threads returned error, aborting. -13
    >> Restart failed: Permission denied
    >> If I recompile 0.8.0 without restore-ids, it doesn't have this error.  
    >> I think that the problem may be that the nscd cache is opened on 
    >> behalf of the program by libc as root, but when BLCR tries to restart 
    >> the checkpointed program as the original user, it can't open the nscd 
    >> cache.  Is there a way to fix this?
    >> --Ted

