From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 13:44:30 PDT
Adolfo J. Banchio wrote: > We have here a cluster which has a mixture ia32 > and EM64T cpu's. And as we already know it is > not possible to restart a job (even a 32bit one) > that started in an 64 bit node in a 32 bit one. > > However, even defining in the default script > for the queue system a default architecture to > prevent jobs started in one to continue in other, > one user ended up restarting in the wrong architecture > producing a KERNEL PANIC !!. > > So, my suggestion is, if possible, to prevent cr_restart > to proceed if it realizes that the checkpoint is from > different architecture and deliver a corresponding error > message. > > We are using blcr-0.5.0_b5-1 on the 64bit nodes and > blcr-0.5.0_b1-1 on the 32bit ones. Just for your information. > > > Best regards, > > adolfo > > > P.S.: again, this is just a suggestion, for a minor thing. > > > IMHO a kernel panic caused by a non-root user is *not* a minor thing. We really could/should include an architecture identifier in the BLCR file header. I've entered a bug report (http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2020) for this issue and hope to resolve it for the current 0.6.0 beta series. -Paul -- Paul H. Hargrove PHHargrove_at_lbl_dot_gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900