This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.

This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via sbatch to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in this page instead.)

Please review the information here and consider which machine fits your purpose best.

Prerequisites

Get a cluster account. You can email admin@cs.earlham.edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at ~username that you can access when you connect to any of them.
- Note: if you have a CS account, you will use the same username and password for your cluster account.
Connect through a terminal via ssh to username@hopper.cluster.earlham.edu. If you intend to work with these machines a lot, you should also configure your ssh keys.
Once you are connected to Hopper, you can connect to any of the machines below.

For example, if I want to connect to Hamilton:

ssh username@hopper.cluster.earlham.edu
ssh username@hamilton.cluster.earlham.edu
Now you are ready to start using Slurm on that cluster. Keep reading for more information on how to get started with Slurm.

Using Slurm

Slurm is our batch scheduler. It lets us run scripts in the background so that you don't have to stay logged and monitor them manually. It also allows us to automatically distribute users/jobs across a set of machines so that we don't all get crowded into one place, helping everyone get to the resources that they need.

There are two main ways to use slurm:

You can submit a job in a script, to run automatically in the background: $ sbatch my_good_script.sbatch
You can start an interactive job through slurm and maintain manual control: srun -n 1 --pty bash -i

Common Directives

Include these at the top of your sbatch file to configure the scheduler to your needs.

#SBATCH --job-name=myjob: Name your job something specific (shows up in squeue
#SBATCH --output=file.out: Sets a file to store output from your script in.
#SBATCH --error=file.err: Sets a file to store errors from your script in.
#SBATCH --time=HH:MM:SS: Sets a maximum time for your job.
#SBATCH --nodes=1: Sets the number of nodes to use for the job.
#SBATCH --cpus-per-task=10: Sets the number of CPUs to use for the job.
#SBATCH --mem=10G: Sets the amount of memory to use for the job.
#SBATCH --mail-user=myemail@earlham.edu: Sets the email to send notifications to when the job status changes.

Example sbatch Files

Example One

This job will do the following:

Print "hellow world!" to the log file.
end.

A few things to notice:

This is about as simple a job as possible.
The job is only using one node, and one CPU on that node. Plenty of room for other users to run their jobs in parallel.

#!/bin/sh
#SBATCH --job-name hello-world
#SBATCH --nodes=1 
#SBATCH -c 1
echo "hello world!"

Example Two

This job will do the following:

Echo some information about the current node (printing it to the output file).
Print the current machine hostname (to the output file, again).
sleep for 10 seconds (wait).
Print the current working directory (to the output file, still).
end.

A few things to notice:

The job is only using one node, and one CPU on that node. Plenty of room for other users to run their jobs in parallel.
The job will notify the user excellent_email_user@earlham.edu via email when it starts, ends, or fails.
The job has a max runtime of 20 seconds, so if something goes wrong and it doesn't end sooner, the scheduler will stop it.

#!/bin/sh
#SBATCH --time=20
#SBATCH --job-name hello-world-two
#SBATCH --nodes=1 
#SBATCH -c 1 # ask for one core
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=excellent_email_user@earlham.edu

echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

/bin/hostname
sleep 10           # Replace this sleep command with your command line. 
/bin/pwd

Example Three

This job will do the following:

Load the python/3.12 module environment on the machine it's running on.
Run the main.py script with the time command (prints information on runtime after the command finishes).
end.

A few things to notice:

The job is only using one node, and 30 CPUs on that node.
The job will notify the user excellent_email_user@earlham.edu via email when it starts, ends, or fails.
The job has a max runtime of 72 hours, so if something goes wrong and it doesn't end sooner, the scheduler will stop it.

#!/bin/bash
#SBATCH --job-name=MNISTDCGAN # Job name
#SBATCH --output=output_%j.txt # Standard output file with job id
#SBATCH --error=error_%j.txt # Standard error file with job id
#SBATCH --time=72:00:00 # Maximum run time
#SBATCH --nodes=1 # Use one node
#SBATCH --cpus-per-task=30 # Request 30 CPU cores (or more, as available)
#SBATCH --mem=16G # Total memory
#SBATCH --mail-type=BEGIN,END,FAIL # Email notifications
#SBATCH --mail-user=username@earlham.edu # Your email address

module load python/3.12 # replace with any python version you want to use
time python main.py # replace with your python file name

After submitting a job slurm captures anything written to stdout and stderr by the programs and when the job completes puts it in a file called slurm-nnn.out (where nnn is the job number) in the directory where you ran sbatch. Use more to view it when you are looking for error messages, output file locations, etc. If you are used to using qpeek, you can instead just run tail -f jobXYZ.out or tail -f jobXYZ.err.

There's some more CPU management information here.

Useful Slurm Commands

Slurm has some other useful commands that you can use to interact with or view jobs that are running.

Commands
Slurm	Description
`sbatch`	run/submit a batch job (.sbatch file, see above)
`squeue`	show jobs currently in the queue
`scancel`	cancel a job by its ID
`scontrol show nodes`	show nodes in the cluster

Environment Variables
Slurm	Description
`$SLURM_JOB_PARTITION`	the queue/partition you are in
`$SLURM_JOB_NODELIST`	there's no equivalent of the nodes file but there is an environment variable that stores that information
`$SLURM_SUBMIT_DIR`	working directory from which the command was run

Cluster systems to choose from

The cluster.earlham.edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (nee "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).

Hamilton

hamilton.cluster.earlham.edu

(Built in 2022) Designed for high CPU/RAM jobs. Used often for photogrammetry work via WebODM and CPU-based AI/ML training.

Services running on this machine

WebODM

Nodes and Hardware

Machine Name	Type	CPU	RAM	GPU
h0	Head Node	AMD EPYC 24-core	128GB	None
h1	Compute Node	AMD EPYC 24-core	256GB	None
h2	Compute Node	AMD EPYC 24-core	256GB	None
h3	Compute Node	AMD EPYC 24-core	256GB	None
h4	Compute Node	AMD EPYC 24-core	256GB	None
h5	Compute Node	AMD EPYC 24-core	256GB	None

Faraday

faraday.cluster.earlham.edu

(Built in 2022) Designed for GPU jobs. Used for research, courses, and projects requiring GPU access, computational biophysics simulations, and notebook services.

Services running on this machine

Jupyterhub

Nodes and Hardware

Machine Name	Type	CPU	RAM	GPU
f0	Head Node	AMD EPYC 16-core	192GB	Nvidia RTX A5000 24GB
f1	Compute Node	AMD EPYC 16-core	192GB	Nvidia RTX A5000 24GB
f2	Compute Node	AMD EPYC 16-core	192GB	Nvidia RTX A5000 24GB
f3	Compute Node	AMD EPYC 16-core	192GB	Nvidia RTX A5000 24GB
f4	Compute Node	AMD EPYC 16-core	192GB	Nvidia RTX A5000 24GB
f5	Compute Node	AMD EPYC 16-core	192GB	Nvidia RTX A5000 24GB

Whedon

whedon.cluster.earlham.edu

(Built in 2015) Designed for high CPU/RAM jobs that require long-term processing. Also used for computational Chemistry simulations with WebMO.

Services running on this machine

WebMO

Nodes and Hardware

Machine Name	Type	CPU	RAM	GPU
w0	Head Node	2x Intel Xeon 8-core	256GB	None
w1	Compute Node	2x Intel Xeon 8-core	256GB	None
w2	Compute Node	2x Intel Xeon 8-core	256GB	None
w3	Compute Node	2x Intel Xeon 8-core	256GB	None
w4	Compute Node	2x Intel Xeon 8-core	256GB	None
w5	Compute Node	2x Intel Xeon 8-core	256GB	None
w6	Compute Node	2x Intel Xeon 8-core	256GB	None
w7	Compute Node	2x Intel Xeon 8-core	256GB	None

Lovelace

lovelace.cluster.earlham.edu

Designed for particularly high RAM jobs. Used often for large scale biology projects, such as alignment and sequence analysis. This machine also features a particularly large amount of disk space for such projects.

Services running on this machine

RStudio

Nodes and Hardware

Machine Name	Type	CPU	RAM	GPU
lovelace	Phat Node	2x Intel Xeon 16-core	1000GB	None

Cluster software bundle

The cluster dot earlham dot edu servers all run a supported Debian GNU/Linux 12.

All these servers (unless otherwise noted) also feature the following software:

Slurm (scheduler): submit a job with sbatch jobname.sbatch, delete it with scancel jobID. Running a job has its own doc section below.
Environment modules: run module avail to see available software modules and module load modulename to load one; you may load modules in bash scripts and sbatch jobs as well.

The default shell on all these servers is bash.

The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.

About qsub

Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.

Tested and working 2022

Getting started on clusters

Contents

Prerequisites

Using Slurm

Common Directives

Example sbatch Files

Example One

Example Two

Example Three

Useful Slurm Commands

Cluster systems to choose from

Hamilton

Services running on this machine

Nodes and Hardware

Faraday

Services running on this machine

Nodes and Hardware

Whedon

Services running on this machine

Nodes and Hardware

Lovelace

Services running on this machine

Nodes and Hardware

Cluster software bundle

About qsub

Navigation menu

Getting started on clusters

Prerequisites

Using Slurm

Common Directives

Example sbatch Files

Example One

Example Two

Example Three

Useful Slurm Commands

Cluster systems to choose from

Hamilton

Services running on this machine

Nodes and Hardware

Faraday

Services running on this machine

Nodes and Hardware

Whedon

Services running on this machine

Nodes and Hardware

Lovelace

Services running on this machine

Nodes and Hardware

Cluster software bundle

About qsub

Navigation menu

Search