Problems with file cache?

From: Ladislav Subr (subr-blcr_at_sirrah.troja.mff.cuni.cz)
Date: Sun Mar 26 2006 - 07:35:55 PST

  • Next message: ╚╬├¸├¸: "CVS version of blcr?"
    Hello,
    
    I'm using BLCR for migration of processes among nodes on a cluster of PCs. The 
    system works quite fine, but not flawlessly. I experience different problems 
    on i386 and x86_64 clusters. In both cases the problems occur quite rarely 
    and I have never succeeded to invoke them under my control.
    
    On the 32bit cluster the problem consists in errorneous writing to open files. 
    It has definitely something to do with caching. The typical chronology is as 
    follows:
    
    1) process is running on node A, has open several files and writes to them
    2) process is checkpointed (with signal 9) and restarted on node B it writes 
    another data to the open files and everything is OK
    3) process is checkpointed and restarted on node A. The datafiles are still OK
    4) process writes to the files on node A -- it writes to correct positions, 
    but all data written on the node B are lost, which menas that they are 
    replaced with zeros.
    
    As I've written above, the problem occurs rather rarely, and analogical 
    migration works fine in most cases. Typically, there are more steps (2) in 
    between, i.e. the process migrates over more nodes before it gets back to A. 
    In that case, all data written on the nodes B, C, etc. are lost. The program 
    that fails has several files of few kB large opened plus one of few MB. The 
    later one has never been corrupted. Usually more of the smaller files are 
    corrupted at the same time, but it is not a rule. I have switched on blcr 
    module tracing, but haven't seen anything suspicious with regard to these 
    failures. I have also modified the logging a bit, so I can tell that the 
    'struct file' members f_flags=0x8002 and f_mode=0x0f (don't know whether it 
    is somehow important). I was willing to alter the code to open the files in 
    write-only mode, but it seems not to be possible (it is a F77 code, not of my 
    own).
    
    All nodes see an identical NFS mounted filesystem (headnode plus several 
    diskless nodes). Curently I'm running vanilla kernel 2.6.11.12 with BLCR 
    0.4.2 on SUSE 9.2. My feeling (and nothing more than feeling) is, that the 
    failures were more fequent on previous distro SUSE 9.0. Another feeling is, 
    that the frequency at which the errors occur has an increasing tendency. But 
    it may be due to varying conditions on the cluster. After the upgrade, it run 
    two weeks flawlessly, before the problem occured; after reloading the modules 
    it took several days untill the error occured again...
    
    
    Regarding the x86_64 cluster (five nodes of dual Opterons) -- I have been 
    working on it intensively a longer time ago, so I don't remember all the 
    details now. It seems that processes that were sometimes manipulated with 
    cr_checkpoint and cr_restart utilities do somehow interfere with each other. 
    One error that occurs is that at the moment when some process is restarted on 
    the node, another one (being restarted there several minutes before) crashes 
    with segfault message in the kernel log:
    
    Mar 24 17:30:04 t4 kernel: yorprot_055c_di[18502]: segfault at 
    fffffff400506030 rip 0000000000404260 rsp 00007ffffffff120 error 4
    
    Sometimes it happens that cr_checkpoint produces process image, that is 
    cr_restarted without complains, but it immediately crashes with segmentation 
    fault (similar log to the above one). I could probably find some checkpoints 
    of that kind. Sometimes it also shoots down another process that has nothing 
    in common with the failing one, besides it was also migrated some time 
    before, i.e. it has /proc/checkpoint/ctrl open.
    
    Alternatively, the process crashes with similar segfault at the moment when no 
    migrations were performed -- like if it has RSP wrongly set up at the 
    previous restart.
    
    The 64bit cluster runs on SUSE 9.2 64bit, vanilla kernel 2.6.11.9 and BLCR 
    0.4.2.
    
    If I remember well, I was describing this problem some time ago -- before 
    release of BLCR 0.4.2. With this version the error is less frequent than with 
    previous betas, but still it sometimes occurs.
    
    
    On x86_64 I haven't observed the problems that occur on i386 and vice versa. 
    But again, it may be due to different conditions on the clusters, i.e. 
    different codes that the users are running there. I'm only rather convinced 
    that the x86_64 problem is specific for that architecture.
    
    I'm afraid, that my 'bug report' is a bit chaotic, sorry for that. Please, let 
    me know, if you have some some suggestions what to try or log... Currently 
    the i386 cluster is more important for me (I can eventually switch the 64bit 
    one to the 32bit regime, but not vice versa).
    
    Best regards
    
    	Ladislav
    

  • Next message: ╚╬├¸├¸: "CVS version of blcr?"