Checkpoint and Restarting
Jump to navigation
Jump to search
Checkpoint Frequency
Checkpointing in Folding@Clusters has two distinct parts:
- mdrun generating checkpoints in an interval given by the nstxout parameter in grompp.mdp.
- The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005).
Signal accepted by mdrun and their relevance to the checkpoint/restart process
The mdrun process accepts SIGTERM and SIGUSR1. These signals can be received by a mdrun process of any rank. The effects of the signals are as follows:
- SIGTERM
- Sets nsteps to current steps plus one.
- SIGUSR1
- Sets nsteps the next multiple of nstxout past the current step.
Neither of these signals provide useful checkpointing mechanism. With some modicition (remove the code that modifies nsteps), the SIGUSR1 mechanism could be useful. Note that we are already using SIGINT in nanny/child communication and SIGINT is our only free signal due to COSM.