Checkpoint and Restarting

Checkpoint Frequency

Checkpointing in Folding@Clusters has two distinct parts:

mdrun generating checkpoints in an interval given by the nstxout parameter in grompp.mdp.
The nanny checks the size of the checkpoint file. If the file has become larger, it is transferred to the mother. This check happens about every two seconds (as of 1 June 2005).

Signal accepted by `mdrun` and their relevance to the checkpoint/restart process

The mdrun process accepts SIGTERM and SIGUSR1. These signals can be received by a mdrun process of any rank. The effects of the signals are as follows:

SIGTERM
- Sets nsteps to current steps plus one.
SIGUSR1
- Sets nsteps the next multiple of nstxout past the current step.

Neither of these signals provide useful checkpointing mechanism. With some modicition (remove the code that modifies nsteps), the SIGUSR1 mechanism could be useful. Note that we are already using SIGINT in nanny/child communication and SIGINT is our only free signal due to COSM.

Using GROMACS tools

Notes:

Using only tpr files as input fails when using grompp to change the number of processes the tpr file is prepared for.
grompp requires a .top file to generate tpr files.
Using both trr and edr files when restarting with grompp results in a more accurrate restart (see the grompp section in the GROMACS 3.2 online documentation).

Altnernative method:

tools used:

trjconv
grompp

files initially required:

original mdout.mdp
original .tpr
original mdout.mdp
result or checkpoint file (.trr)
topology file (.top)

files created:

new .gro
new .tpr

Process:

Create a .gro file that incorporates the result and the topology files.
trjconv -s original.tpr -f checkpoint.trr -o new.gro
Use the .gro as input to grompp to create a new .tpr file configured for the new number of nodes.
grompp -f mdout.mdp -c new.gro -p topol.top -np 4 -o new.tpr

Notes:

A simulation can only be ran for a set amount of time without modification. Even after restarting, the simulation cannot continue past the simulation time specified when grompp was initially executed. To extend the simulation, tpbconv must be used with the -until or -extend (which take a number of picoseconds as arguments) options:

     tpbconv -s topol.tpr -f traj.trr -e ener.edr -o new.tpr -extend 10

Checkpoint and Restarting

Checkpoint Frequency

Signal accepted by `mdrun` and their relevance to the checkpoint/restart process

Using GROMACS tools

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

websites

wiki

applied groups

Tools

Checkpoint and Restarting

Checkpoint Frequency

Signal accepted by mdrun and their relevance to the checkpoint/restart process

Using GROMACS tools

Navigation menu

Search

Signal accepted by `mdrun` and their relevance to the checkpoint/restart process