Checkpoint/Restart in a MPI Environment Meeting Notes - LBNL / IU March 13 - 15, 2002 Quick Note: ----------- This is more of a brain dump on my part than anything else. I've tried to stay within the bounds of what was discussed, but am also trying to get as much on paper (ok, in a text file) as possible. If I wonder too far, feel free to slap me back into shape in your comments on all this... Overview: --------- LBNL is interested in checkpoint/restart systems for Linux clusters. They currently have a mostly working beta implementation of a checkpoint system for a single job. They plan to have a fully functional first-take at a checkpoint/restart (CR) system by the summer of 2002. Parallel applications, however, are going to be much more difficult - hence the collaboration with us (Indiana University). The ultra-high-level overview of this project is simple. We get notified that we need to be checkpointed. We get LAM ready to be checkpointed. We tell someone we are ready to be checkpointed. We get checkpointed. We might continue after the checkpoint. At some random point in time and space, we might get restarted. The process of getting ready to be checkpointed consists of a couple of things: * Making sure there is no data "in flight" that will get lost if we die after the checkpoint * Ensuring all needed state is somewhere we can get at it later (ie, not in some chunk of code that won't be checkpointed. * Making sure everyone is in the same state when we tell the system "Go" The process of restarting will require a couple things from LAM: * Any sockets we have are gone. We have to re-initialize them * We've probably moved, so we are going to need to ask the lamd who else is out there * So, we are pretty much left with MPI_INIT all over again. Great, isn't it? LBNL's CR System: ----------------- This is from memory, so it is hopefully correct. Please fix any mistakes. * Overview: The CR system is intended to provide sysadmins with an easy way to checkpoint and restart processes at any time for any reason. The common usage cases are load balancing, the queue re-ordering and prioritizing (for example, a queue that only runs at night), and system maintenance. While reliability in terms of recovering from a crash is an interesting side effect, it is not the main goal of the project (because Linux machines when configured to look like the T3E will never crash). It is assumed that when a process is restarted, it will be restarted by either the user or (the more probable case) by the system's batch scheduling system. Therefore, the system can require some special steps to get the system up and running again after a restart. * Details: The notification mechanism will work like a Unix signal (the first version might use Unix signals, but this will change in future versions). The user will register a handler (function prototype available in header file) that will be called whenever the process is to be checkpointed. Upon leaving the function, the process will be checkpointed. While in the handler, the function must conform to all the requirements placed on Unix signal handlers. The current requirements documentation does not specify behavior in a parallel environment - only that the infrastructure and mechanisms will be the same as for the single job. They will add code for us, but there are limits ;). Parallel CR System: ------------------- The question is: How do we take the LBNL CR system and extend it to parallel jobs? The discussion below is more "How do we run LAM/MPI jobs under the CR system" and many LAM/MPI details will be discussed. While the discussion at hand is mainly about implementing an MPI CR system within LAM/MPI, we should keep a do-no-harm philosophy. * Terminology: This is more to keep the LAM guys straight, as we tend to get lazy with some of our terminology from time to time :). "Runtime environment" refers to the environment under which LAM and the user application is running. This would include the OS, a batch scheduler, the LBNL CR stuff, etc. The LAM Runtime Environment (or LAM RTE) is the set of lam daemons (lamds) that allow LAM/MPI applications to execute. * Scope: Initially, the scope of the project will be to checkpoint and restart an MPI-1 application. The long term goal will be to checkpoint and restart an MPI-2 application. We should maintain a "do no harm" approach, but there is no expectation that an MPI-2 application will behave properly under the initial release of a CR system. The comment was made (by me) that the MPI forum didn't try to solve the world's problems in one standard, why should we? I think this is a fairly valid argument and keeps the scope to something reasonable. That being said, we will probably get a lot of MPI-2 "for free" in getting MPI-1 to work. The notable exception, of course, is the dynamic process control. * Checkpoint Interface: The first question becomes how the runtime system is going to notify a parallel job that it will be checkpointed. There are two separate issues that should be examined when developing a plan. First, the "expected behavior" of an MPI application. Second, the hell that is running on a signal stack. Expected behavior of an MPI application is a tricky thing. Each process in an MPI job is going to have to dump a separate image. Yet, the processes are not completely separate and must coordinate with one-another in order to checkpoint properly. From a systems administration and user point of view, it seems like the "right thing" is to notify the job as a whole to checkpoint itself. How exactly you signal a group of processes running on separate nodes is a bit of a problem. The most elegant solution seems to be signaling MPIRUN that it should really think about checkpointing the application. Signaling MPIRUN, who then gets all the MPI processes to checkpoint is also very attractive from a "signal stack hell" point of view. One of the big requirements on the MPI application is going to be to get it to "shut up" - get all the communication finished so that we can checkpoint without losing in flight data. This would be nearly impossible in signal handler context, as we can't communicate over any of our sockets. MPIRUN actually has very little communication requirements during execution (stdio forwarding occurs "around" mpirun, so we don't have to worry about any of that). Using the critical section behavior that LBNL is looking to implement, we can probably even get the process to the point where can talk to the lamd however we want - we should be able to guarantee that the socket is clear when we enter the handler. So, signaling MPIRUN and having it coordinate everything seems to give us expected user behavior and is the easiest thing to implement in terms of avoiding "context stack hell". It seems that everyone was in agreement of this. The exact details of how to make sure that the other MPI processes don't get signaled was not worked out, but it seems that this is quite possible and not something to be concerned about. So I'm not going to be concerned with it. * Restart Interface: There are two ways that an application will get restarted: by the user that started the process (by hand, shell script, etc) or by the batch scheduler. LBNL has no plans to restart jobs out of rc.local or anything funky like that. This is actually a "good thing" for us in many ways. First, we can get some information from the user. Second, we don't have to worry about bootstrapping ourselves automagically. A restart sequence might be something along the lines of: % lamboot % mpirestart % lamhalt Notice that the LAM RTE was explicitly brought up, and not from a checkpoint. That means that mpirestart, working only from user supplied data and the list of image files, is going to have to get the lam daemons prepared for the applications. I think that this model makes life a lot easier for everyone. First, by it's very nature, we are going to be able to migrate - we just need to get the list of where to start from lamboot. If we use the (already existing) scheduling system in MPIRUN, we should be able to schedule the restarted processes within the new LAM RTE. Things will get a little bit tricky in a heterogeneous environment, so we might also need to have application schemas to match images to nodes. Not a huge deal, and the LAM guys can probably get good code reuse out of what we already have. mpirestart is, of course, just a working name - we can name this whatever we want. Ideally, we could make mpirun smart enough to figure it all out, but I'm not sure how possible that is. Something to think about. * Implementation ideas for mpirestart: Let's assume that the user has fired up a LAM RTE and is calling mpirestart. The user supplies a list of images. We fire the images up. The first thing they do is run a function that is basically the moral equivalent of MPI_INIT. Let's call it mpi_reinit for now. This function needs to make sure that when the application leaves the call, it is in the same state as an application leaving MPI_INIT. It can talk to the other processes in the MPI application, all the data structures are properly initialized, etc. For LAM, this also is where we could call kinit() to get ourselves registered with the lamd. MPIRESTART is still around, so it can be used in the same way MPIRUN is used at initial job startup - it can be the rendezvous point for all the processes as they come to life. We probably need to add some checks to make sure that the number of images supplied is equal to the number of processes that were in existence when we shut down. This shouldn't be all that painful, but is something to think about. We can squirrel some data away in the call stack of the mpi application before we checkpoint, so this shouldn't be that difficult. Besides, we can always look at the size of the gps array. * Details on restarting an application So, this is one of those points that I don't really know much about. Right now, I'm going on the assumption that "magic happens" to get an image transmorgified into a running process. For now, let's just assume that there is some function "cr_fire_this_thing_up(image_name)" that will do all the required vodo. There was some talk by the LBL guys of hacking the Linux kernel such that the image looked like a startable process (ie, you could just do ./image_name and the right thing would happen). This would be very good for the LAM team, as we could just use the existing job startup code and exec() the image as if it was a normal binary. The lamd wouldn't even have to know whether it was firing up a regular binary or a checkpoint image. Outstanding questions: ---------------------- * Can we get enough communication in MPIRUN in the signal handler context, or are we completely hosed? - what can we run in a signal handler context? - If we can't, what is our next option? * What is the interface for restarting an application