RE: Restart Failed: permission denied

From: Tom Spyrou (tspyrou_at_cadence_dot_com)
Date: Wed Mar 28 2007 - 16:19:25 PST

  • Next message: Paul H. Hargove: "Re: Restart Failed: permission denied"
    Hi Paul,
    
    Thanks for answering!
    
    I tried running the counter example. The context is created but when I
    do a restore I get a bad address error. I ran make check and noticed the
    first several pass and then many of the restores fail. I am running
    redhat enterprise 4, 2.6.2 kernel and gcc 3.6.4 and didn't see any
    compile issues. At first I used gcc 4.1.1 but then I got an error from
    dmesg during insmod and switched to 3.6.4.
    
    Any suggestions?
    
    Tom
    
    -----Original Message-----
    From: Paul H. Hargrove [mailto:PHHargrove_at_lbl_dot_gov] 
    Sent: Wednesday, March 28, 2007 5:12 PM
    To: Tom Spyrou
    Cc: checkpoint_at_lbl_dot_gov
    Subject: Re: Restart Failed: permission denied
    
    Tom,
    
      Thanks for your interest in BLCR.
    
      The first thing I note is that if you are talking about the normal
    xclock found as /usr/X11R6/bin/xclock or /usr/bin/X11/xclock on most
    systems, then it *does* open a socket used to talk to the X server.
    That alone is enough reason for the restart to fail.
    
      However, the error you are seeing is not related to that socket.
    Rather you are encountering the evil of the Name Service Cache Daemon
    (nscd).  The issue with nscd is that it passes a file descriptor (man
    sendmsg and/or recvmsg) from the daemon (which has permissions to open()
    the file) to a client (which cannot open() the file itself).  The client
    can/will then mmap() the file.  When BLCR is trying to reestablish the
    mmap()s that a process had before the checkpoint, it fails because the
    nscd-passed file descriptor can't be recreated w/o help from nscd - help
    which it cannot provide during the restart.
    
      Since nscd is mostly only useful to programs that do name service
    lookups, which in turn usually implies sockets, it doesn't often create
    problems for BLCR.  However, since it can also handle/cache the getpw*()
    family of functions for glibc, one can see problems even when not using
    sockets.  You might consider removing nscd from your system to  help
    avoid some future BLCR problems.
    
      If you "make examples" in your BLCR build directory, you'll get some
    very simple/silly programs you can try checkpointing and restarting.
    
    -Paul
    
    Tom Spyrou wrote:
    > Hi,
    >  
    > I am a new user and have installed and successfully created a 
    > checkpoint of an xclock application run as a trial.
    > When I try to restart the application, I get the error in the subject 
    > and when I type dmesg I see the following errors.
    >  
    > I don't think xclock opens a socket or uses shared memory.
    >  
    > I was wondering if anyone had an idea or a better sample application 
    > that would be supported.
    >  
    > Thanks,
    >  
    > Tom
    > 
    > Skipping a socket.
    > vmadump: mmap failed: /var/db/nscd/hosts thaw_threads returned error, 
    > aborting. -13
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargove: "Re: Restart Failed: permission denied"