Difference between revisions of "Getting started on clusters"

From Earlham CS Department
Jump to navigation Jump to search
m (Submit job via command line interface)
 
(24 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 
This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.
 
This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.
  
This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via qsub to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in [[Sysadmin:Services:ClusterOverview |this page]] instead.)
+
 
 +
This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via sbatch to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in [[Sysadmin:Services:ClusterOverview |this page]] instead.)
  
 
__TOC__
 
__TOC__
Line 8: Line 9:
  
 
# Get a cluster account. You can email admin at cs dot earlham dot edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at <code>~username</code> that you can access when you connect to any of them.
 
# Get a cluster account. You can email admin at cs dot earlham dot edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at <code>~username</code> that you can access when you connect to any of them.
 +
## Note: if you have a CS account, you will use the same username and password for your cluster account.
 
# Connect through a terminal via ssh to <code>username@hopper.cluster.earlham.edu</code>. If you intend to work with these machines a lot, you should also configure your [[How To Set Up SSH Keys | ssh keys]].
 
# Connect through a terminal via ssh to <code>username@hopper.cluster.earlham.edu</code>. If you intend to work with these machines a lot, you should also configure your [[How To Set Up SSH Keys | ssh keys]].
  
Line 15: Line 17:
  
 
Our current machines are:
 
Our current machines are:
 
+
* hamilton: 5 compute nodes, 256GB RAM per node; features most CPU cores per node and highest clock speed.
* whedon: newest cluster; 8 compute nodes
+
* whedon: 7 compute nodes; 256GB of RAM per node.
* layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options
+
* layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options.
* lovelace: newest jumbo server
+
* lovelace: jumbo server.
* pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space
+
* pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space.
  
 
To get to, e.g., whedon, from hopper, run <code>ssh whedon</code>.
 
To get to, e.g., whedon, from hopper, run <code>ssh whedon</code>.
Line 31: Line 33:
 
All these servers (unless otherwise noted) also feature the following software:
 
All these servers (unless otherwise noted) also feature the following software:
  
* Torque (scheduler): submit a job with <code>qsub jobname.qsub</code>, delete it with <code>qdel jobID</code>. Running a job has its own doc section below.
+
* Slurm (scheduler): submit a job with <code>sbatch jobname.sbatch</code>, delete it with <code>scancel jobID</code>. Running a job has its own doc section below.
* Environment modules: run <code>module avail</code> to see available software modules and <code>module load modulename</code> to load one; you may load modules in bash scripts and qsub jobs as well.
+
* Environment modules: run <code>module avail</code> to see available software modules and <code>module load modulename</code> to load one; you may load modules in bash scripts and sbatch jobs as well.
  
 
The default shell on all these servers is bash.
 
The default shell on all these servers is bash.
Line 38: Line 40:
 
The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.
 
The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.
  
= Using qsub =
+
= Using Slurm =
  
Torque is our scheduler. The command to submit something to be run on the scheduler is qsub. Knowing how to use qsub is the key to using our high-performance computing systems effectively.
+
Slurm is our batch scheduler.
  
== Submit job using a .qsub file (recommended) ==
+
You can check that it's working by running: <code>$ srun -l hostname</code>
 +
 
 +
You can submit a job in a script with the following: <code>$ sbatch my_good_script.sbatch</code>
 +
 
 +
Here's an example of a batch file, note that the time parameter may be too short for "real" runs:
 +
<pre>
 +
#!/bin/sh
 +
#SBATCH --time=20
 +
#SBATCH --job-name hello-world
 +
#SBATCH --nodes=1
 +
#SBATCH -c 1 # ask for one core
 +
#SBATCH --mail-type=BEGIN,END,FAIL
 +
#SBATCH --mail-user=excellent_email_user@earlham.edu
 +
 
 +
echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
 +
echo "running on `echo $SLURM_JOB_NODELIST`"
 +
echo "work directory is `echo $SLURM_SUBMIT_DIR`"
 +
 
 +
/bin/hostname
 +
sleep 10          # Replace this sleep command with your command line.
 +
/bin/pwd
 +
 
 +
</pre>
  
To submit a job to PBS, you'll need to write a shell script wrapper around it and submit it through qsub on your system of choice. For example (change the specific options):
+
After submitting a job slurm captures anything written to stdout and stderr by the programs and when the job completes puts it in a file called slurm-nnn.out (where nnn is the job number) in the directory where you ran sbatch. Use more to view it when you are looking for error messages, output file locations, etc. If you are used to using <code>qpeek</code>, you can instead just run <code>tail -f jobXYZ.out</code> or <code>tail -f jobXYZ.err</code>.
  
 +
An interactive command line interface is also supported for development work:
 
<pre>
 
<pre>
#PBS -N out
+
srun -n 1 --pty bash -i
#PBS -q batch
 
#PBS -l nodes=1:ppn=1
 
#PBS -m abe
 
#PBS -M username@example.com
 
#PBS -o /path/to/my/stdout
 
#PBS -e /path/to/my/stderr
 
 
</pre>
 
</pre>
  
In this example:
+
This will allocate one node to you and start a session on that node. It is also possible to allocate multiple nodes, CUDA nodes, etc. Be sure to exit when you are done to return to your login shell.  
* -N gives the job name
+
 
* -q tells Torque which queue to submit the job to
+
There's some more CPU management information [https://slurm.schedmd.com/cpu_management.html here].
* -l tells Torque how many nodes and how many processors-per-node to request
 
** on a phat node, nodes=1; on a cluster, nodes=N where N>1 (most often)
 
* -m abe tells Torque what mail options to use. abe enables all mail information.
 
* -M tells Torque the email address
 
* -o tells Torque where to copy STDOUT to
 
* -e tells Torque where to copy STDERR to
 
  
It's sometimes useful to include the following snippet as well, so that your output files remind you of some other system details in the days or weeks after your job terminates:
+
== Conversion from Torque to Slurm ==
  
 +
To submit a job to Slurm, you'll need to write a shell script wrapper and submit it through sbatch on your system of choice. For people familiar with pbs the pattern is very similar. For example (change the specific options):
 +
 +
 +
{| class="wikitable"
 +
|+ Commands
 +
|-
 +
! Torque
 +
! Slurm
 +
! Description
 +
|-
 +
| <code>qsub</code>
 +
| <code>sbatch</code>
 +
| run/submit a batch job
 +
|-
 +
| <code>qstat</code>
 +
| <code>squeue</code>
 +
| show jobs currently in the queue
 +
|-
 +
| <code>qdel</code>
 +
| <code>scancel</code>
 +
| cancel a job
 +
|-
 +
| <code>pbsnodes -a</code>
 +
| <code>scontrol show nodes</code>
 +
| show nodes in the cluster
 +
|-
 +
| <code>qstat ...</code>
 +
| <code>scontrol update NodeName=w[0-6] State=RESUME</code>
 +
| resurrect nodes that are offline
 +
|}
 +
 +
{| class="wikitable"
 +
|+ Environment Variables
 +
|-
 +
! Torque
 +
! Slurm
 +
! Description
 +
|-
 +
| <code>$PBS_QUEUE</code>
 +
| <code>$SLURM_JOB_PARTITION</code>
 +
| the queue/partition you are in
 +
|-
 +
| <code>cat $PBS_NODEFILE</code>
 +
| <code>$SLURM_JOB_NODELIST</code>
 +
| there's no equivalent of the nodes file but there is an environment variable that stores that information
 +
|-
 +
| <code>$PBS_O_WORKDIR</code>
 +
| <code>$SLURM_SUBMIT_DIR</code>
 +
| working directory from which the command was run
 +
|}
 +
 +
Example scripts
 
<pre>
 
<pre>
echo "running on `cat $PBS_NODEFILE`"
+
#!/usr/bin/bash
echo "hostname is `hostname`"
 
echo "on launch cwd is `pwd`"
 
echo "PBS_O_WORKDIR is `echo $PBS_O_WORKDIR`"
 
  
cd $PBS_O_WORKDIR
+
#SBATCH --job-name hello-world
 +
#SBATCH --nodes=5
 +
#SBATCH --mail-type=BEGIN,END,FAIL
 +
#SBATCH --mail-user=excellent_email_user@earlham.edu
 +
 
 +
echo "queue is `echo $SLURM_JOB_PARTITION`"
 +
echo "running on `echo $SLURM_JOB_NODELIST`"
 +
echo "work directory is `echo $SLURM_SUBMIT_DIR`"
 +
 
 +
srun -l echo "hello world!"
 
</pre>
 
</pre>
  
Using a qsub file is the recommended approach because it is easy to gather your files and share them with others if you have an issue.
+
If your motivation is to run ODM inside a Docker container than this pattern should work on pollock and lovelace. Note that the log file is not created (interaction between nohup and slurm) but you can use/save the slurm-###.out file which has the same information in it.  
  
== Submit job via command line interface ==
+
Normally you preface each command in a slurm file with srun (slurm-run), with Docker/ODM this appears to make things go pear-shaped.
 +
<pre>
 +
#!/bin/sh
 +
#SBATCH --job-name stod-slurm-test-2D-lowest
 +
#SBATCH --nodes=1
 +
#SBATCH -c 4 # ask for four cores
 +
#SBATCH --mail-type=BEGIN,END,FAIL
 +
#SBATCH --mail-user=charliep@cs.earlham.edu
  
You can also run a bash script through qsub via CLI. Here's an example script:
+
echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
 +
echo "running on `echo $SLURM_JOB_NODELIST`"
 +
echo "work directory is `echo $SLURM_SUBMIT_DIR`"
  
<tt>
+
sudo rm -rf 2D-lowest-f1
#!/bin/sh
+
sudo rm -rf tmp
#PBS -N primes_block
 
#PBS -o /path/to/out.txt
 
#PBS -e /path/to/err.txt
 
#PBS -q batch
 
#PBS -m abe
 
 
echo "Hello world!" <br>
 
exit $! <br>
 
</tt>
 
  
* To submit that job, you might run something like this on b0:
+
nohup ~/gitlab-current/images/drone_image_tools/assemble-odm.sh -r lowest -i images -d 2 -e charliep@cs.earlham.edu &
<pre>qsub -l nodes=8,cput=0:30:0 ./hello.sh</pre>
+
wait
This will request 8 nodes for a total of 30 minutes CPU time (granted, this is an absurdly long time to request for "hello world").
 
  
== Related commands ==
+
exit 0
 +
</pre>
  
Including qsub for search convenience, here are some common Torque commands you'll probably use (in descending order of probable frequency):
+
= About qsub =
  
* qsub: Submit a job to a queue.
+
Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.
* qstat: Show a list of running, queued, or recently ended jobs.
 
* qdel: Delete a job you've submitted.
 
* qhold: Hold a job from execution.
 
  
Each command has a man page.
+
Tested and working 2022

Latest revision as of 19:52, 29 February 2024

This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.


This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via sbatch to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in this page instead.)

Prerequisites

  1. Get a cluster account. You can email admin at cs dot earlham dot edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at ~username that you can access when you connect to any of them.
    1. Note: if you have a CS account, you will use the same username and password for your cluster account.
  2. Connect through a terminal via ssh to username@hopper.cluster.earlham.edu. If you intend to work with these machines a lot, you should also configure your ssh keys.

Cluster systems to choose from

The cluster dot earlham dot edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (nee "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).

Our current machines are:

  • hamilton: 5 compute nodes, 256GB RAM per node; features most CPU cores per node and highest clock speed.
  • whedon: 7 compute nodes; 256GB of RAM per node.
  • layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options.
  • lovelace: jumbo server.
  • pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space.

To get to, e.g., whedon, from hopper, run ssh whedon.

If you're still not sure, click here for more detailed notes.

Cluster software bundle

The cluster dot earlham dot edu servers all run a supported CentOS version.

All these servers (unless otherwise noted) also feature the following software:

  • Slurm (scheduler): submit a job with sbatch jobname.sbatch, delete it with scancel jobID. Running a job has its own doc section below.
  • Environment modules: run module avail to see available software modules and module load modulename to load one; you may load modules in bash scripts and sbatch jobs as well.

The default shell on all these servers is bash.

The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.

Using Slurm

Slurm is our batch scheduler.

You can check that it's working by running: $ srun -l hostname

You can submit a job in a script with the following: $ sbatch my_good_script.sbatch

Here's an example of a batch file, note that the time parameter may be too short for "real" runs:

#!/bin/sh
#SBATCH --time=20
#SBATCH --job-name hello-world
#SBATCH --nodes=1 
#SBATCH -c 1 # ask for one core
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=excellent_email_user@earlham.edu

echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

/bin/hostname
sleep 10           # Replace this sleep command with your command line. 
/bin/pwd

After submitting a job slurm captures anything written to stdout and stderr by the programs and when the job completes puts it in a file called slurm-nnn.out (where nnn is the job number) in the directory where you ran sbatch. Use more to view it when you are looking for error messages, output file locations, etc. If you are used to using qpeek, you can instead just run tail -f jobXYZ.out or tail -f jobXYZ.err.

An interactive command line interface is also supported for development work:

srun -n 1 --pty bash -i

This will allocate one node to you and start a session on that node. It is also possible to allocate multiple nodes, CUDA nodes, etc. Be sure to exit when you are done to return to your login shell.

There's some more CPU management information here.

Conversion from Torque to Slurm

To submit a job to Slurm, you'll need to write a shell script wrapper and submit it through sbatch on your system of choice. For people familiar with pbs the pattern is very similar. For example (change the specific options):


Commands
Torque Slurm Description
qsub sbatch run/submit a batch job
qstat squeue show jobs currently in the queue
qdel scancel cancel a job
pbsnodes -a scontrol show nodes show nodes in the cluster
qstat ... scontrol update NodeName=w[0-6] State=RESUME resurrect nodes that are offline
Environment Variables
Torque Slurm Description
$PBS_QUEUE $SLURM_JOB_PARTITION the queue/partition you are in
cat $PBS_NODEFILE $SLURM_JOB_NODELIST there's no equivalent of the nodes file but there is an environment variable that stores that information
$PBS_O_WORKDIR $SLURM_SUBMIT_DIR working directory from which the command was run

Example scripts

#!/usr/bin/bash

#SBATCH --job-name hello-world
#SBATCH --nodes=5
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=excellent_email_user@earlham.edu

echo "queue is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

srun -l echo "hello world!"

If your motivation is to run ODM inside a Docker container than this pattern should work on pollock and lovelace. Note that the log file is not created (interaction between nohup and slurm) but you can use/save the slurm-###.out file which has the same information in it.

Normally you preface each command in a slurm file with srun (slurm-run), with Docker/ODM this appears to make things go pear-shaped.

#!/bin/sh
#SBATCH --job-name stod-slurm-test-2D-lowest
#SBATCH --nodes=1 
#SBATCH -c 4 # ask for four cores
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=charliep@cs.earlham.edu

echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

sudo rm -rf 2D-lowest-f1
sudo rm -rf tmp

nohup ~/gitlab-current/images/drone_image_tools/assemble-odm.sh -r lowest -i images -d 2 -e charliep@cs.earlham.edu &
wait

exit 0

About qsub

Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.

Tested and working 2022