Checkpointing

This involves capturing the state of your computation in the minimal amount of data necessary to restart from that ``snapshot'' and not waste the work done to get to this stage. Since most large jobs are iterative, it is simply a matter of saving the essential information to start another iteration. If your computation is not iterative, it may not be so obvious what and when to checkpoint - even with an iterative calculation, you must be aware of the size of your data set and the approximate cpu-time per iteration to checkpoint wisely. Since the amount of data saved in a checkpoint will often be large, it should be done sparingly and with as fast an I/O mechanism as possible (in large, binary files).

When to checkpoint is up to you. Some users use timers to check when their jobs time is nearly up, others catch signals sent by the queing system, others checkpoint at regular interval (say every 30 to 60 minutes of cpu-time) just in case the machine crashes during their run. The cautious users have two checkpoint files which they alternate dumps to in case the queing system kills their job (or the machine crashes) during checkpointing.

You should be aware of how much time checkpointing takes - there is little point in a 10 hour job if 6 hours are spent in checkpoint I/O!

Although some operating systems and queing systems offer automatic checkpointing and even automatic restarting, this is often very inefficient since a complete copy of the users memory is written to disk. Generally only a fraction of this data is necessary to restart the job. It is also preferrable for the user to control the frequency of checkpointing.