Getting started on clusters
This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.
This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via sbatch to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in this page instead.)
- Get a cluster account. You can email admin at cs dot earlham dot edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at
~usernamethat you can access when you connect to any of them.
- Note: if you have a CS account, you will use the same username and password for your cluster account.
- Connect through a terminal via ssh to
email@example.com. If you intend to work with these machines a lot, you should also configure your ssh keys.
Cluster systems to choose from
The cluster dot earlham dot edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (nee "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).
Our current machines are:
- hamilton: newest cluster; 5 compute nodes, 256GB RAM per node; features most CPU cores per node and highest clock speed.
- whedon: 7 compute nodes; 256GB of RAM per node.
- layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options.
- lovelace: newest jumbo server.
- pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space.
To get to, e.g., whedon, from hopper, run
If you're still not sure, click here for more detailed notes.
Cluster software bundle
The cluster dot earlham dot edu servers all run a supported CentOS version.
All these servers (unless otherwise noted) also feature the following software:
- Slurm (scheduler): submit a job with
sbatch jobname.sbatch, delete it with
scancel jobID. Running a job has its own doc section below.
- Environment modules: run
module availto see available software modules and
module load modulenameto load one; you may load modules in bash scripts and sbatch jobs as well.
The default shell on all these servers is bash.
The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.
Slurm is our batch scheduler.
You can check that it's working by running:
srun -l hostname
You can submit a job in a script with the following:
Here's an example of a batch file, note that the time parameter may be too short for "real" runs:
#!/bin/sh #SBATCH --time=20 #SBATCH --job-name hello-world #SBATCH --nodes=1 #SBATCH -c 1 # ask for one core #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --firstname.lastname@example.org echo "queue/partition is `echo $SLURM_JOB_PARTITION`" echo "running on `echo $SLURM_JOB_NODELIST`" echo "work directory is `echo $SLURM_SUBMIT_DIR`" srun -l /bin/hostname srun sleep 10 # Replace this sleep command with your command line. srun -l /bin/pwd
Interactive and command line interfaces also exist. After submitting a job slurm captures anything written to stdout and stderr by the programs and when the job completes puts it in a file called slurm-nnn.out (where nnn is the job number) in the directory where you ran sbatch. Use more to view it when you are looking for error messages, output file locations, etc.
If you are used to using
qpeek, you can instead just run
tail -f jobXYZ.out or
tail -f jobXYZ.err.
There's some more CPU management information here.
Conversion from Torque to Slurm
To submit a job to Slurm, you'll need to write a shell script wrapper and submit it through sbatch on your system of choice. For people familiar with pbs the pattern is very similar. For example (change the specific options):
||run/submit a batch job|
||show jobs currently in the queue|
||cancel a job|
||show nodes in the cluster|
||resurrect nodes that are offline|
||the queue/partition you are in|
||there's no equivalent of the nodes file but there is an environment variable that stores that information|
||working directory from which the command was run|
#!/usr/bin/bash #SBATCH --job-name hello-world #SBATCH --nodes=5 #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --email@example.com echo "queue is `echo $SLURM_JOB_PARTITION`" echo "running on `echo $SLURM_JOB_NODELIST`" echo "work directory is `echo $SLURM_SUBMIT_DIR`" srun -l echo "hello world!"
If your motivation is to run ODM inside a Docker container than this pattern should work on pollock and lovelace. Note that the log file is not created (interaction between nohup and slurm) but you can use/save the slurm-###.out file which has the same information in it.
Normally you preface each command in a slurm file with srun (slurm-run), with Docker/ODM this appears to make things go pear-shaped.
#!/bin/sh #SBATCH --job-name stod-slurm-test-2D-lowest #SBATCH --nodes=1 #SBATCH -c 4 # ask for four cores #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --firstname.lastname@example.org echo "queue/partition is `echo $SLURM_JOB_PARTITION`" echo "running on `echo $SLURM_JOB_NODELIST`" echo "work directory is `echo $SLURM_SUBMIT_DIR`" sudo rm -rf 2D-lowest-f1 sudo rm -rf tmp nohup ~/gitlab-current/images/drone_image_tools/assemble-odm.sh -r lowest -i images -d 2 -e email@example.com & wait exit 0
Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.