Re: Restart Failed: permission denied

From: Paul H. Hargove (hargrove_at_hpcrd_dot_lbl_dot_gov)
Date: Wed Mar 28 2007 - 16:24:48 PST

  • Next message: EIN News: "EIN News Alert: John McCain in Trouble"
    The insmod error w/ 4.1.1 is important - the kernel is enforcing some
    ABI consistency.  Your switch to 3.6.4 was the right fix.
    If you are getting "bad address" errors with BLCR 0.5.2, you may wish to
    try 0.5.1 instead.  One of the changes in 0.5.1->0.5.2 apparently traded
    one bug for another, producing "bad address" errors on some systems with
    certain libraries.  When I get around to pushing out 0.5.3 that issue
    should be resolved.
    Tom Spyrou wrote:
    > Hi Paul,
    > Thanks for answering!
    > I tried running the counter example. The context is created but when I
    > do a restore I get a bad address error. I ran make check and noticed the
    > first several pass and then many of the restores fail. I am running
    > redhat enterprise 4, 2.6.2 kernel and gcc 3.6.4 and didn't see any
    > compile issues. At first I used gcc 4.1.1 but then I got an error from
    > dmesg during insmod and switched to 3.6.4.
    > Any suggestions?
    > Tom
    > -----Original Message-----
    > From: Paul H. Hargrove [mailto:PHHargrove_at_lbl_dot_gov] 
    > Sent: Wednesday, March 28, 2007 5:12 PM
    > To: Tom Spyrou
    > Cc: checkpoint_at_lbl_dot_gov
    > Subject: Re: Restart Failed: permission denied
    > Tom,
    >   Thanks for your interest in BLCR.
    >   The first thing I note is that if you are talking about the normal
    > xclock found as /usr/X11R6/bin/xclock or /usr/bin/X11/xclock on most
    > systems, then it *does* open a socket used to talk to the X server.
    > That alone is enough reason for the restart to fail.
    >   However, the error you are seeing is not related to that socket.
    > Rather you are encountering the evil of the Name Service Cache Daemon
    > (nscd).  The issue with nscd is that it passes a file descriptor (man
    > sendmsg and/or recvmsg) from the daemon (which has permissions to open()
    > the file) to a client (which cannot open() the file itself).  The client
    > can/will then mmap() the file.  When BLCR is trying to reestablish the
    > mmap()s that a process had before the checkpoint, it fails because the
    > nscd-passed file descriptor can't be recreated w/o help from nscd - help
    > which it cannot provide during the restart.
    >   Since nscd is mostly only useful to programs that do name service
    > lookups, which in turn usually implies sockets, it doesn't often create
    > problems for BLCR.  However, since it can also handle/cache the getpw*()
    > family of functions for glibc, one can see problems even when not using
    > sockets.  You might consider removing nscd from your system to  help
    > avoid some future BLCR problems.
    >   If you "make examples" in your BLCR build directory, you'll get some
    > very simple/silly programs you can try checkpointing and restarting.
    > -Paul
    > Tom Spyrou wrote:
    >> Hi,
    >> I am a new user and have installed and successfully created a 
    >> checkpoint of an xclock application run as a trial.
    >> When I try to restart the application, I get the error in the subject 
    >> and when I type dmesg I see the following errors.
    >> I don't think xclock opens a socket or uses shared memory.
    >> I was wondering if anyone had an idea or a better sample application 
    >> that would be supported.
    >> Thanks,
    >> Tom
    >> Skipping a socket.
    >> vmadump: mmap failed: /var/db/nscd/hosts thaw_threads returned error, 
    >> aborting. -13
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: EIN News: "EIN News Alert: John McCain in Trouble"