From: Ladislav Subr (subr-blcr_at_sirrah.troja.mff.cuni.cz)
Date: Sun Mar 26 2006 - 07:35:55 PST
Hello, I'm using BLCR for migration of processes among nodes on a cluster of PCs. The system works quite fine, but not flawlessly. I experience different problems on i386 and x86_64 clusters. In both cases the problems occur quite rarely and I have never succeeded to invoke them under my control. On the 32bit cluster the problem consists in errorneous writing to open files. It has definitely something to do with caching. The typical chronology is as follows: 1) process is running on node A, has open several files and writes to them 2) process is checkpointed (with signal 9) and restarted on node B it writes another data to the open files and everything is OK 3) process is checkpointed and restarted on node A. The datafiles are still OK 4) process writes to the files on node A -- it writes to correct positions, but all data written on the node B are lost, which menas that they are replaced with zeros. As I've written above, the problem occurs rather rarely, and analogical migration works fine in most cases. Typically, there are more steps (2) in between, i.e. the process migrates over more nodes before it gets back to A. In that case, all data written on the nodes B, C, etc. are lost. The program that fails has several files of few kB large opened plus one of few MB. The later one has never been corrupted. Usually more of the smaller files are corrupted at the same time, but it is not a rule. I have switched on blcr module tracing, but haven't seen anything suspicious with regard to these failures. I have also modified the logging a bit, so I can tell that the 'struct file' members f_flags=0x8002 and f_mode=0x0f (don't know whether it is somehow important). I was willing to alter the code to open the files in write-only mode, but it seems not to be possible (it is a F77 code, not of my own). All nodes see an identical NFS mounted filesystem (headnode plus several diskless nodes). Curently I'm running vanilla kernel 2.6.11.12 with BLCR 0.4.2 on SUSE 9.2. My feeling (and nothing more than feeling) is, that the failures were more fequent on previous distro SUSE 9.0. Another feeling is, that the frequency at which the errors occur has an increasing tendency. But it may be due to varying conditions on the cluster. After the upgrade, it run two weeks flawlessly, before the problem occured; after reloading the modules it took several days untill the error occured again... Regarding the x86_64 cluster (five nodes of dual Opterons) -- I have been working on it intensively a longer time ago, so I don't remember all the details now. It seems that processes that were sometimes manipulated with cr_checkpoint and cr_restart utilities do somehow interfere with each other. One error that occurs is that at the moment when some process is restarted on the node, another one (being restarted there several minutes before) crashes with segfault message in the kernel log: Mar 24 17:30:04 t4 kernel: yorprot_055c_di[18502]: segfault at fffffff400506030 rip 0000000000404260 rsp 00007ffffffff120 error 4 Sometimes it happens that cr_checkpoint produces process image, that is cr_restarted without complains, but it immediately crashes with segmentation fault (similar log to the above one). I could probably find some checkpoints of that kind. Sometimes it also shoots down another process that has nothing in common with the failing one, besides it was also migrated some time before, i.e. it has /proc/checkpoint/ctrl open. Alternatively, the process crashes with similar segfault at the moment when no migrations were performed -- like if it has RSP wrongly set up at the previous restart. The 64bit cluster runs on SUSE 9.2 64bit, vanilla kernel 2.6.11.9 and BLCR 0.4.2. If I remember well, I was describing this problem some time ago -- before release of BLCR 0.4.2. With this version the error is less frequent than with previous betas, but still it sometimes occurs. On x86_64 I haven't observed the problems that occur on i386 and vice versa. But again, it may be due to different conditions on the clusters, i.e. different codes that the users are running there. I'm only rather convinced that the x86_64 problem is specific for that architecture. I'm afraid, that my 'bug report' is a bit chaotic, sorry for that. Please, let me know, if you have some some suggestions what to try or log... Currently the i386 cluster is more important for me (I can eventually switch the 64bit one to the 32bit regime, but not vice versa). Best regards Ladislav