From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 28 2007 - 16:11:54 PST
Tom, Thanks for your interest in BLCR. The first thing I note is that if you are talking about the normal xclock found as /usr/X11R6/bin/xclock or /usr/bin/X11/xclock on most systems, then it *does* open a socket used to talk to the X server. That alone is enough reason for the restart to fail. However, the error you are seeing is not related to that socket. Rather you are encountering the evil of the Name Service Cache Daemon (nscd). The issue with nscd is that it passes a file descriptor (man sendmsg and/or recvmsg) from the daemon (which has permissions to open() the file) to a client (which cannot open() the file itself). The client can/will then mmap() the file. When BLCR is trying to reestablish the mmap()s that a process had before the checkpoint, it fails because the nscd-passed file descriptor can't be recreated w/o help from nscd - help which it cannot provide during the restart. Since nscd is mostly only useful to programs that do name service lookups, which in turn usually implies sockets, it doesn't often create problems for BLCR. However, since it can also handle/cache the getpw*() family of functions for glibc, one can see problems even when not using sockets. You might consider removing nscd from your system to help avoid some future BLCR problems. If you "make examples" in your BLCR build directory, you'll get some very simple/silly programs you can try checkpointing and restarting. -Paul Tom Spyrou wrote: > Hi, > > I am a new user and have installed and successfully created a checkpoint > of an xclock application run as a trial. > When I try to restart the application, I get the error in the subject > and when I type dmesg I see the following errors. > > I don't think xclock opens a socket or uses shared memory. > > I was wondering if anyone had an idea or a better sample application > that would be supported. > > Thanks, > > Tom > > Skipping a socket. > vmadump: mmap failed: /var/db/nscd/hosts > thaw_threads returned error, aborting. -13 -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900