Problems with file cache?

Date view	Thread view	Subject view	Author view	Attachment view

From: Ladislav Subr (subr-blcr_at_sirrah.troja.mff.cuni.cz)
Date: Sun Mar 26 2006 - 07:35:55 PST

Next message: ��: "CVS version of blcr?"

Previous message: RETScreen International: "Clean Energy Decision Support Centre / Centre d'aide � la d�cision sur les �nergies propres"

Hello,

I'm using BLCR for migration of processes among nodes on a cluster of PCs. The 
system works quite fine, but not flawlessly. I experience different problems 
on i386 and x86_64 clusters. In both cases the problems occur quite rarely 
and I have never succeeded to invoke them under my control.

On the 32bit cluster the problem consists in errorneous writing to open files. 
It has definitely something to do with caching. The typical chronology is as 
follows:

1) process is running on node A, has open several files and writes to them
2) process is checkpointed (with signal 9) and restarted on node B it writes 
another data to the open files and everything is OK
3) process is checkpointed and restarted on node A. The datafiles are still OK
4) process writes to the files on node A -- it writes to correct positions, 
but all data written on the node B are lost, which menas that they are 
replaced with zeros.

As I've written above, the problem occurs rather rarely, and analogical 
migration works fine in most cases. Typically, there are more steps (2) in 
between, i.e. the process migrates over more nodes before it gets back to A. 
In that case, all data written on the nodes B, C, etc. are lost. The program 
that fails has several files of few kB large opened plus one of few MB. The 
later one has never been corrupted. Usually more of the smaller files are 
corrupted at the same time, but it is not a rule. I have switched on blcr 
module tracing, but haven't seen anything suspicious with regard to these 
failures. I have also modified the logging a bit, so I can tell that the 
'struct file' members f_flags=0x8002 and f_mode=0x0f (don't know whether it 
is somehow important). I was willing to alter the code to open the files in 
write-only mode, but it seems not to be possible (it is a F77 code, not of my 
own).

All nodes see an identical NFS mounted filesystem (headnode plus several 
diskless nodes). Curently I'm running vanilla kernel 2.6.11.12 with BLCR 
0.4.2 on SUSE 9.2. My feeling (and nothing more than feeling) is, that the 
failures were more fequent on previous distro SUSE 9.0. Another feeling is, 
that the frequency at which the errors occur has an increasing tendency. But 
it may be due to varying conditions on the cluster. After the upgrade, it run 
two weeks flawlessly, before the problem occured; after reloading the modules 
it took several days untill the error occured again...


Regarding the x86_64 cluster (five nodes of dual Opterons) -- I have been 
working on it intensively a longer time ago, so I don't remember all the 
details now. It seems that processes that were sometimes manipulated with 
cr_checkpoint and cr_restart utilities do somehow interfere with each other. 
One error that occurs is that at the moment when some process is restarted on 
the node, another one (being restarted there several minutes before) crashes 
with segfault message in the kernel log:

Mar 24 17:30:04 t4 kernel: yorprot_055c_di[18502]: segfault at 
fffffff400506030 rip 0000000000404260 rsp 00007ffffffff120 error 4

Sometimes it happens that cr_checkpoint produces process image, that is 
cr_restarted without complains, but it immediately crashes with segmentation 
fault (similar log to the above one). I could probably find some checkpoints 
of that kind. Sometimes it also shoots down another process that has nothing 
in common with the failing one, besides it was also migrated some time 
before, i.e. it has /proc/checkpoint/ctrl open.

Alternatively, the process crashes with similar segfault at the moment when no 
migrations were performed -- like if it has RSP wrongly set up at the 
previous restart.

The 64bit cluster runs on SUSE 9.2 64bit, vanilla kernel 2.6.11.9 and BLCR 
0.4.2.

If I remember well, I was describing this problem some time ago -- before 
release of BLCR 0.4.2. With this version the error is less frequent than with 
previous betas, but still it sometimes occurs.


On x86_64 I haven't observed the problems that occur on i386 and vice versa. 
But again, it may be due to different conditions on the clusters, i.e. 
different codes that the users are running there. I'm only rather convinced 
that the x86_64 problem is specific for that architecture.

I'm afraid, that my 'bug report' is a bit chaotic, sorry for that. Please, let 
me know, if you have some some suggestions what to try or log... Currently 
the i386 cluster is more important for me (I can eventually switch the 64bit 
one to the 32bit regime, but not vice versa).

Best regards

	Ladislav

Next message: ��: "CVS version of blcr?"

Previous message: RETScreen International: "Clean Energy Decision Support Centre / Centre d'aide � la d�cision sur les �nergies propres"

Date view	Thread view	Subject view	Author view	Attachment view