Re: 32bit and 64bit platforms: suggestion

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 13:44:30 PDT

  • Next message: Paul H. Hargrove: "Re: Simple blcr API usage?"
    Adolfo J. Banchio wrote:
    > We have here a cluster which has a mixture ia32 
    > and EM64T cpu's. And as we already know it is
    > not possible to restart a job (even a 32bit one)
    > that started in an 64 bit node in a 32 bit one.
    >
    > However, even defining in the default script 
    > for the queue system a default architecture to
    > prevent jobs started in one to continue in other,
    > one user ended up restarting in the wrong architecture
    > producing a KERNEL PANIC !!.
    >
    > So, my suggestion is, if possible, to prevent cr_restart
    > to proceed if it realizes that the checkpoint is from
    > different architecture and deliver a corresponding error
    > message.
    >
    > We are using blcr-0.5.0_b5-1 on the 64bit nodes and
    > blcr-0.5.0_b1-1 on the 32bit ones. Just for your information.
    >
    >
    > Best regards,
    >
    > adolfo
    >
    >
    > P.S.: again, this is just a suggestion, for a minor thing.
    >
    >
    >   
    
    IMHO a kernel panic caused by a non-root user is *not* a minor thing.
    We really could/should include an architecture identifier in the BLCR 
    file header.
    I've entered a bug report 
    (http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2020) for this issue 
    and hope to resolve it for the current 0.6.0 beta series.
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: Simple blcr API usage?"