Re: Problems with --enable-restore-ids

From: Ted Cabeen (cabeen_at_chem_dot_ucsb_dot_edu)
Date: Thu Feb 19 2009 - 10:47:33 PST

  • Next message: Eric Roman: "Re: Problems with --enable-restore-ids"
    In this case, I am running BLCR with torque, so I don't have direct 
    access to exactly what filehandles torque has open.  Looking in the 
    /proc filesystem when the job is running (not checkpointed), there are 
    three processes, all of which have a fd 16 pointing at the same pipe:
    29301/fd/16 -> pipe:[249689]
    29302/fd/16 -> pipe:[249689]
    29304/fd/16 -> pipe:[249689]
    
    Those three processes map to the user's shell, the copy of sh running 
    the user's job script, and the active process of the job (in this case, 
    just a sleep for testing).  Is that helpful?
    
    --Ted
    
    
    Paul H. Hargrove wrote:
    > Ted,
    >  Thanks for your patience.  The restore-ids code itself seems to be 
    > doing just what it should: dropping the root privilege before performing 
    > any fs permission checks, preventing use of a maliciously modified 
    > checkpoint context file as a way to access otherwise inaccessible 
    > files.  However, there seem to be problems with files (ncsd is just one 
    > example) that were originally openned *with* some privilege.  Anything 
    > setup by the batch system is a candidate for such problems.
    > 
    >  For the new problem I can see the problem, but don't know enough to 
    > suggest a solution.
    >  The term "external pipe" means that at the time the checkpoint was 
    > taken there was a pipe that had only one endpoint within the scope of 
    > the checkpoint, while the other was not.  In a batch scheduled 
    > environment this is often the case for the std{in,out,err} descriptors, 
    > but in your case the error says fd=16, so it must be something else.
    >  When an "external pipe" is encountered in the context file at restart 
    > time, BLCR's behavior is to try to connect this fd to same file as the 
    > stdin or stdout (depending on which end of the pipe is external) of the 
    > cr_restart process.  In your case the user has insufficient permission 
    > to do so, most likely because the cr_restart was launched as root and 
    > root owns the file (or device) that are used for std{in,out}.
    > 
    >  Since I don't know what fd 16 was being used for,  I can't be certain 
    > that connecting it to stdin or stdout is even the right thing to do.  If 
    > it is the right thing, then I will need to go back to looking at the 
    > BLCR source code and determine if the permission checks being performed 
    > (the ones that yield the first error) are required for 
    > correctness/security.  My initial thought is that if the stdin/out of 
    > cr_restart have not been marked close-on-exec, then any child it creates 
    > potentially has access to those fds as its own stdin/out and bypassing 
    > fs permissions sounds like the right thing to do.
    > 
    >  So, it is possible that BLCR needs to be doing something different with 
    > respect to the permissions when reopenning external pipes.  I will look 
    > into that and get back to you.  However, I'd appreciate it if you could 
    > also be looking into figuring out what fd 16 is being used for.  It is 
    > entirely possible that it is something that will require a different 
    > approach.
    > 
    > -Paul
    > 
    > Ted Cabeen wrote:
    >> All right.  I've disabled nscd, and we're on to the next problem. 
    >> (Sorry I missed that note in the FAQ)  When I restart a job with 
    >> enable-restore-ids, I get the following error:
    >> - Error -13 from cr_filp_reopen() while restoring external pipe
    >> - cr_restore_all_files [3766]:  Unable to restore fd 16 (type=4,err=-13)
    >> - cr_rstrt_child [3766]:  Unable to restore files!  (err=-13)
    >> Restart failed: Permission denied
    >>
    >> My suspended jobs start fine when complied without enable-restore-ids.
    >>
    >> Thoughts?
    >>
    >> --Ted
    >>
    >> Paul H. Hargrove wrote:
    >>> Ted,
    >>>
    >>>  You are right about the nscd cache file being opened as root (or 
    >>> other "system" id).  The program acquires the file descriptor via fd 
    >>> passing from a privileged daemon process. Since we can't safely 
    >>> reopen this file as the user and we are equally unable to reproduce 
    >>> the descriptor passing from the daemon, BLCR is incompatible with 
    >>> nscd (see FAQ:  http://mantis.lbl.gov/blcr/doc/html/FAQ.html#nscd ).  
    >>> If you were to perform the restart as the original user you would 
    >>> encounter this problem regardless of --enable-restore-ids of not.  I 
    >>> am afraid the only known solution is to disable nscd.
    >>>
    >>> -Paul
    >>>
    >>> Ted Cabeen wrote:
    >>>> I'm having problems with 0.8.0 with --enable-restore-ids.  When I 
    >>>> try to restart a checkpointed job, I get the following error:
    >>>> - open('/var/cache/nscd/passwd', 0x0) failed: -13
    >>>> - mmap failed: /var/cache/nscd/passwd
    >>>> - thaw_threads returned error, aborting. -13
    >>>> Restart failed: Permission denied
    >>>>
    >>>> If I recompile 0.8.0 without restore-ids, it doesn't have this 
    >>>> error.  I think that the problem may be that the nscd cache is 
    >>>> opened on behalf of the program by libc as root, but when BLCR tries 
    >>>> to restart the checkpointed program as the original user, it can't 
    >>>> open the nscd cache.  Is there a way to fix this?
    >>>>
    >>>> --Ted
    >>>
    >>>
    > 
    > 
    

  • Next message: Eric Roman: "Re: Problems with --enable-restore-ids"