Re: checkpointing across different cpu's

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Apr 29 2010 - 12:44:12 PDT

  • Next message: "Re: checkpointing across different cpu's"
    I don't know an exact or simple answer to your question, but I can list 
    some things that I *know* would prevent migration between machines:
    + No migration between 32- and 64-bit CPUs even if the process was 32-bit.
    + No migration between kernels that are "too different", where I have 
    not good definition.
    + Will SEGV if pre-linking places shared libs at different addresses (we 
    have a FAQ entry for this)
    + There are 2 or more different FPU state save/restore instructions 
    available for the kernel to use, depending on the generation of CPU.  I 
    don't know for certain, but I strongly suspect that state saved in the 
    checkpoint by one such instruction would not restore with a different one.
    The last two items are my best guess since you indicate you are using 
    the same kernel.
    Good luck and please let me know if you learn anything more.
    If we can collect more info, I will update the FAQ entry about migration.
    > Hi:
    >  I've checkpointed/restarted jobs on different CPU's before for example:
    > I've checkpointed on a AMD processor and restarted on a xeon processor.
    > It does not seem to work all the time however. I just did a checkpoint 
    > on a XEON and
    > tried to restart on a AMD and I got a segmentation fault. trying to 
    > restart the application.
    > My question is under what  circumstances I can restart on a different 
    > x64 CPU.
    > How can I build my code so I won't have problems with this. (Or should 
    > it be working).
    > I am using blcr 0.8.0 on  2.6.9-55.ELsmp kernels.
    > With Blessings
    > and Best regards,
    > Jerry
    > 2363
    > You shall do no unrighteousness in judgment; you shall not favor the 
    > poor, nor favor the mighty; but in righteousness you shall judge your 
    > neighbor.
    > (Torah portion, Kedoshim, Leviticus 19:15)
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

  • Next message: "Re: checkpointing across different cpu's"