Difference between revisions of "Getting started on clusters"

From Earlham CS Department
Jump to navigation Jump to search
m (Cluster systems to choose from)
Line 1: Line 1:
 +
This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.
 +
 +
This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via qsub to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in [[Sysadmin:Services:ClusterOverview |this page]] instead.)
 +
 +
__TOC__
 +
 
= Prerequisites =  
 
= Prerequisites =  
  
Line 6: Line 12:
 
= Cluster systems to choose from =  
 
= Cluster systems to choose from =  
  
The cluster dot earlham dot edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (previously "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).
+
The cluster dot earlham dot edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (nee "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).
  
 
Our current machines are:
 
Our current machines are:
Line 17: Line 23:
 
To get to, e.g., whedon, from hopper, run <code>ssh whedon</code>.
 
To get to, e.g., whedon, from hopper, run <code>ssh whedon</code>.
  
= Cluster software notes =
+
If you're still not sure, [[Choosing a computing resource|click here for more detailed notes]].
 +
 
 +
= Cluster software bundle =
  
 
The cluster dot earlham dot edu servers all run a supported CentOS version.
 
The cluster dot earlham dot edu servers all run a supported CentOS version.
Line 23: Line 31:
 
All these servers (unless otherwise noted) also feature the following software:
 
All these servers (unless otherwise noted) also feature the following software:
  
* Torque (scheduler): submit a job with <code>qsub jobname.qsub</code>, delete it with <code>qdel jobID</code>. [[Torque|Docs here]].
+
* Torque (scheduler): submit a job with <code>qsub jobname.qsub</code>, delete it with <code>qdel jobID</code>. Running a job has its own doc section below.
 
* Environment modules: run <code>module avail</code> to see available software modules and <code>module load modulename</code> to load one; you may load modules in bash scripts and qsub jobs as well.
 
* Environment modules: run <code>module avail</code> to see available software modules and <code>module load modulename</code> to load one; you may load modules in bash scripts and qsub jobs as well.
  
 
The default shell on all these servers is bash.
 
The default shell on all these servers is bash.
  
The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of available scientific computing libraries.
+
The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.
 +
 
 +
= Using qsub =
 +
 
 +
Torque is our scheduler. The command to submit something to be run on the scheduler is qsub. Knowing how to use qsub is the key to using our high-performance computing systems effectively.
 +
 
 +
== Submit job using a .qsub file (recommended) ==
 +
 
 +
To submit a job to PBS, you'll need to write a shell script wrapper around it and submit it through qsub on your system of choice. For example (change the specific options):
 +
 
 +
<pre>
 +
#PBS -N out
 +
#PBS -q batch
 +
#PBS -l nodes=1:ppn=1
 +
#PBS -m abe
 +
#PBS -M username@example.com
 +
#PBS -o /path/to/my/stdout
 +
#PBS -e /path/to/my/stderr
 +
</pre>
 +
 
 +
In this example:
 +
* -N gives the job name
 +
* -q tells Torque which queue to submit the job to
 +
* -l tells Torque how many nodes and how many processors-per-node to request
 +
** on a phat node, nodes=1; on a cluster, nodes=N where N>1 (most often)
 +
* -m abe tells Torque what mail options to use. abe enables all mail information.
 +
* -M tells Torque the email address
 +
* -o tells Torque where to copy STDOUT to
 +
* -e tells Torque where to copy STDERR to
 +
 
 +
It's sometimes useful to include the following snippet as well, so that your output files remind you of some other system details in the days or weeks after your job terminates:
 +
 
 +
<pre>
 +
echo "running on `cat $PBS_NODEFILE`"
 +
echo "hostname is `hostname`"
 +
echo "on launch cwd is `pwd`"
 +
echo "PBS_O_WORKDIR is `echo $PBS_O_WORKDIR`"
 +
 
 +
cd $PBS_O_WORKDIR
 +
</pre>
 +
 
 +
Using a qsub file is the recommended approach because it is easy to gather your files and share them with others if you have an issue.
 +
 
 +
== Submit job via command line interface ==
 +
 
 +
You can also run a bash script through qsub via CLI. Here's an example script:
 +
 
 +
<tt>
 +
#!/bin/sh
 +
#PBS -N primes_block
 +
#PBS -o /cluster/home/skylar/athena/src/primes4/out.txt
 +
#PBS -e /cluster/home/skylar/athena/src/primes4/err.txt
 +
#PBS -q batch
 +
#PBS -m abe
 +
 +
echo "Hello world!" <br>
 +
exit $! <br>
 +
</tt>
 +
 
 +
* To submit that job, you might run something like this on b0:
 +
<pre>qsub -l nodes=8,cput=0:30:0 ./hello.sh</pre>
 +
This will request 8 nodes for a total of 30 minutes CPU time (granted, this is an absurdly long time to request for "hello world").
 +
 
 +
== Related commands ==
 +
 
 +
Including qsub for search convenience, here are some common Torque commands you'll probably use (in descending order of probable frequency):
 +
 
 +
* qsub: Submit a job to a queue.
 +
* qstat: Show a list of running, queued, or recently ended jobs.
 +
* qdel: Delete a job you've submitted.
 +
* qhold: Hold a job from execution.
 +
 
 +
Each command has a man page.

Revision as of 09:41, 4 May 2020

This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.

This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via qsub to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in this page instead.)

Prerequisites

  1. Get a cluster account. You can email admin at cs dot earlham dot edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at ~username that you can access when you connect to any of them.
  2. Connect through a terminal via ssh to username@hopper.cluster.earlham.edu. If you intend to work with these machines a lot, you should also configure your ssh keys.

Cluster systems to choose from

The cluster dot earlham dot edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (nee "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).

Our current machines are:

  • whedon: newest cluster; 8 compute nodes
  • layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options
  • lovelace: newest jumbo server
  • pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space

To get to, e.g., whedon, from hopper, run ssh whedon.

If you're still not sure, click here for more detailed notes.

Cluster software bundle

The cluster dot earlham dot edu servers all run a supported CentOS version.

All these servers (unless otherwise noted) also feature the following software:

  • Torque (scheduler): submit a job with qsub jobname.qsub, delete it with qdel jobID. Running a job has its own doc section below.
  • Environment modules: run module avail to see available software modules and module load modulename to load one; you may load modules in bash scripts and qsub jobs as well.

The default shell on all these servers is bash.

The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.

Using qsub

Torque is our scheduler. The command to submit something to be run on the scheduler is qsub. Knowing how to use qsub is the key to using our high-performance computing systems effectively.

Submit job using a .qsub file (recommended)

To submit a job to PBS, you'll need to write a shell script wrapper around it and submit it through qsub on your system of choice. For example (change the specific options):

#PBS -N out 
#PBS -q batch
#PBS -l nodes=1:ppn=1
#PBS -m abe
#PBS -M username@example.com
#PBS -o /path/to/my/stdout
#PBS -e /path/to/my/stderr

In this example:

  • -N gives the job name
  • -q tells Torque which queue to submit the job to
  • -l tells Torque how many nodes and how many processors-per-node to request
    • on a phat node, nodes=1; on a cluster, nodes=N where N>1 (most often)
  • -m abe tells Torque what mail options to use. abe enables all mail information.
  • -M tells Torque the email address
  • -o tells Torque where to copy STDOUT to
  • -e tells Torque where to copy STDERR to

It's sometimes useful to include the following snippet as well, so that your output files remind you of some other system details in the days or weeks after your job terminates:

echo "running on `cat $PBS_NODEFILE`"
echo "hostname is `hostname`"
echo "on launch cwd is `pwd`"
echo "PBS_O_WORKDIR is `echo $PBS_O_WORKDIR`"

cd $PBS_O_WORKDIR

Using a qsub file is the recommended approach because it is easy to gather your files and share them with others if you have an issue.

Submit job via command line interface

You can also run a bash script through qsub via CLI. Here's an example script:

#!/bin/sh
#PBS -N primes_block
#PBS -o /cluster/home/skylar/athena/src/primes4/out.txt
#PBS -e /cluster/home/skylar/athena/src/primes4/err.txt
#PBS -q batch
#PBS -m abe

echo "Hello world!" 
exit $!

  • To submit that job, you might run something like this on b0:
qsub -l nodes=8,cput=0:30:0 ./hello.sh

This will request 8 nodes for a total of 30 minutes CPU time (granted, this is an absurdly long time to request for "hello world").

Related commands

Including qsub for search convenience, here are some common Torque commands you'll probably use (in descending order of probable frequency):

  • qsub: Submit a job to a queue.
  • qstat: Show a list of running, queued, or recently ended jobs.
  • qdel: Delete a job you've submitted.
  • qhold: Hold a job from execution.

Each command has a man page.