Difference between revisions of "Checkpoint and Restarting"

From Earlham CS Department
Jump to navigation Jump to search
 
Line 5: Line 5:
 
<li>The nanny checks the size of the checkpoint file.  If the file has become larger, it is transferred to the mother.  This check happens about every two seconds (as of 1 June 2005).
 
<li>The nanny checks the size of the checkpoint file.  If the file has become larger, it is transferred to the mother.  This check happens about every two seconds (as of 1 June 2005).
 
</li></ol>
 
</li></ol>
 +
 +
<h2>Signals to <tt>mdrun</tt> and their relevance to the checkpoint/restart process</h2>
 +
 +
The <tt>mdrun</tt> process accepts SIGTERM and SIGUSR1.  These signals can be received by a <tt>mdrun</tt> process of any rank.  The effects of the signals are as follows:
 +
<ul>
 +
<li><b>SIGTERM</b>
 +
<ul>
 +
<li>Sets nsteps to current steps plu one.</li>
 +
</ul></li>
 +
 +
<li>SIGUSR1<ul>
 +
<li></li>
 +
</ul>Sets nsteps the next multiple of nstxout past the current step.</li>
 +
</ul>

Revision as of 15:13, 1 June 2005

Checkpoint Frequency

Checkpointing in Folding@Clusters has two distinct parts:

  1. mdrun generating checkpoints in an interval given by the nstxout parameter in grompp.mdp.
  2. The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005).

Signals to mdrun and their relevance to the checkpoint/restart process

The mdrun process accepts SIGTERM and SIGUSR1. These signals can be received by a mdrun process of any rank. The effects of the signals are as follows:

  • SIGTERM
    • Sets nsteps to current steps plu one.
  • SIGUSR1
    Sets nsteps the next multiple of nstxout past the current step.