Difference between revisions of "Checkpoint and Restarting"
Jump to navigation
Jump to search
Line 5: | Line 5: | ||
<li>The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005). | <li>The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005). | ||
</li></ol> | </li></ol> | ||
+ | |||
+ | <h2>Signals to <tt>mdrun</tt> and their relevance to the checkpoint/restart process</h2> | ||
+ | |||
+ | The <tt>mdrun</tt> process accepts SIGTERM and SIGUSR1. These signals can be received by a <tt>mdrun</tt> process of any rank. The effects of the signals are as follows: | ||
+ | <ul> | ||
+ | <li><b>SIGTERM</b> | ||
+ | <ul> | ||
+ | <li>Sets nsteps to current steps plu one.</li> | ||
+ | </ul></li> | ||
+ | |||
+ | <li>SIGUSR1<ul> | ||
+ | <li></li> | ||
+ | </ul>Sets nsteps the next multiple of nstxout past the current step.</li> | ||
+ | </ul> |
Revision as of 15:13, 1 June 2005
Checkpoint Frequency
Checkpointing in Folding@Clusters has two distinct parts:
- mdrun generating checkpoints in an interval given by the nstxout parameter in grompp.mdp.
- The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005).
Signals to mdrun and their relevance to the checkpoint/restart process
The mdrun process accepts SIGTERM and SIGUSR1. These signals can be received by a mdrun process of any rank. The effects of the signals are as follows:
- SIGTERM
- Sets nsteps to current steps plu one.
- SIGUSR1