Difference between revisions of "Getting started on clusters"

From Earlham CS Department
Jump to navigation Jump to search
m
 
(9 intermediate revisions by 5 users not shown)
Line 1: Line 1:
<h1> Simple Linux Utility for Resource Management (SLURM) </h1>
 
SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. SLURM requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced accounting, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
 
 
This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.
 
This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.
  
<h2> ARCHITECTURE </h2>
 
SLURM has a centralized manager, <code> slurmctld </code>, to monitor resources and work. There may also be a backup manager to assume those responsibilities in the event of failure. Each compute server (node) has a <code>  slurmd </code>, daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. The <code> slurmd </code>, daemons provide fault-tolerant hierarchical communications. There is an optional <code> slurmdbd </code> (Slurm DataBase Daemon) which can be used to record accounting information for multiple Slurm-managed clusters in a single database. User tools include <code> srun </code> to initiate jobs,  <code> scancel </code> to terminate queued or running jobs, <code> sinfo </code> to report system status, <code> squeue </code> to report the status of jobs, and <code> sacct </code> to get information about jobs and job steps that are running or have completed. The <code> sview </code>  commands graphically reports system and job status including network topology. There is an administrative tool <code> scontrol </code> available to monitor and/or modify configuration and state information on the cluster. The administrative tool used to manage the database is <code> sacctmgr </code>. It can be used to identify the clusters, valid users, valid bank accounts, etc. APIs are available for all functions.
 
  
 
+
This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via sbatch to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in [[Sysadmin:Services:ClusterOverview |this page]] instead.)
This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via qsub to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in [[Sysadmin:Services:ClusterOverview |this page]] instead.)
 
  
 
__TOC__
 
__TOC__
Line 22: Line 17:
  
 
Our current machines are:
 
Our current machines are:
 
+
* hamilton: 5 compute nodes, 256GB RAM per node; features most CPU cores per node and highest clock speed.
* whedon: newest cluster; 8 compute nodes; Torque-only pending an OS upgrade
+
* whedon: 7 compute nodes; 256GB of RAM per node.
* layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options
+
* layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options.
* lovelace: newest jumbo server
+
* lovelace: jumbo server.
* pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space
+
* pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space.
  
 
To get to, e.g., whedon, from hopper, run <code>ssh whedon</code>.
 
To get to, e.g., whedon, from hopper, run <code>ssh whedon</code>.
Line 39: Line 34:
  
 
* Slurm (scheduler): submit a job with <code>sbatch jobname.sbatch</code>, delete it with <code>scancel jobID</code>. Running a job has its own doc section below.
 
* Slurm (scheduler): submit a job with <code>sbatch jobname.sbatch</code>, delete it with <code>scancel jobID</code>. Running a job has its own doc section below.
* Environment modules: run <code>module avail</code> to see available software modules and <code>module load modulename</code> to load one; you may load modules in bash scripts and qsub jobs as well.
+
* Environment modules: run <code>module avail</code> to see available software modules and <code>module load modulename</code> to load one; you may load modules in bash scripts and sbatch jobs as well.
  
 
The default shell on all these servers is bash.
 
The default shell on all these servers is bash.
Line 53: Line 48:
 
You can submit a job in a script with the following: <code>sbatch my_good_script.sbatch</code>
 
You can submit a job in a script with the following: <code>sbatch my_good_script.sbatch</code>
  
Here's an example of a batch file:
+
Here's an example of a batch file, note that the time parameter may be too short for "real" runs:
 
<pre>
 
<pre>
 
#!/bin/sh
 
#!/bin/sh
#SBATCH --time=1
+
#SBATCH --time=20
 
#SBATCH --job-name hello-world
 
#SBATCH --job-name hello-world
 
#SBATCH --nodes=1  
 
#SBATCH --nodes=1  
Line 67: Line 62:
 
echo "work directory is `echo $SLURM_SUBMIT_DIR`"
 
echo "work directory is `echo $SLURM_SUBMIT_DIR`"
  
/bin/hostname
 
 
srun -l /bin/hostname
 
srun -l /bin/hostname
sleep 10
+
srun sleep 10           # Replace this sleep command with your command line.
 
srun -l /bin/pwd
 
srun -l /bin/pwd
  
Line 82: Line 76:
 
== Conversion from Torque to Slurm ==
 
== Conversion from Torque to Slurm ==
  
To submit a job to PBS, you'll need to write a shell script wrapper around it and submit it through qsub on your system of choice. For example (change the specific options):
+
To submit a job to Slurm, you'll need to write a shell script wrapper and submit it through sbatch on your system of choice. For people familiar with pbs the pattern is very similar. For example (change the specific options):
  
  
Line 107: Line 101:
 
| <code>scontrol show nodes</code>
 
| <code>scontrol show nodes</code>
 
| show nodes in the cluster
 
| show nodes in the cluster
 +
|-
 +
| <code>qstat ...</code>
 +
| <code>scontrol update NodeName=w[0-6] State=RESUME</code>
 +
| resurrect nodes that are offline
 
|}
 
|}
  
Line 129: Line 127:
 
|}
 
|}
  
Example script
+
Example scripts
 
<pre>
 
<pre>
 
#!/usr/bin/bash
 
#!/usr/bin/bash
Line 145: Line 143:
 
</pre>
 
</pre>
  
<h3> Some useful commands </h3>
+
If your motivation is to run ODM inside a Docker container than this pattern should work on pollock and lovelace. Note that the log file is not created (interaction between nohup and slurm) but you can use/save the slurm-###.out file which has the same information in it.
  
Man pages exist for all Slurm daemons, commands, and API functions. The command option <code> --help </code> also provides a brief summary of options. Note that the command options are all case sensitive.
+
Normally you preface each command in a slurm file with srun (slurm-run), with Docker/ODM this appears to make things go pear-shaped.  
<code> sacct </code> is used to report job or job step accounting information about active or completed jobs.
+
<pre>
<code> salloc </code> is used to allocate resources for a job in real time. Typically this is used to allocate resources and spawn a shell. The shell is then used to execute srun commands to launch parallel tasks.
+
#!/bin/sh
<code> sattach </code> is used to attach standard input, output, and error plus signal capabilities to a currently running job or job step. One can attach to and detach from jobs multiple times.
+
#SBATCH --job-name stod-slurm-test-2D-lowest
<code> sbatch </code> is used to submit a job script for later execution. The script will typically contain one or more <code> srun </code>  commands to launch parallel tasks.
+
#SBATCH --nodes=1
<code> sbcast </code> is used to transfer a file from local disk to local disk on the nodes allocated to a job. This can be used to effectively use diskless compute nodes or provide improved performance relative to a shared file system.
+
#SBATCH -c 4 # ask for four cores
<code> scancel </code> is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
+
#SBATCH --mail-type=BEGIN,END,FAIL
<code> scontrol </code> is the administrative tool used to view and/or modify SLURM state. Note that many <code> scontrol </code> commands can only be executed as user root.
+
#SBATCH --mail-user=charliep@cs.earlham.edu
<code> sinfo </code> reports the state of partitions and nodes managed by SLURM. It has a wide variety of filtering, sorting, and formatting options.
 
<code> sprio </code> is used to display a detailed view of the components affecting a job's priority.
 
<code> squeue </code> reports the state of jobs or job steps. It has a wide variety of filtering, sorting, and formatting options. By default, it reports the running jobs in priority order and then the pending jobs in priority order.
 
<code> srun </code> is used to submit a job for execution or initiate job steps in real time. <code> srun </code> has a wide variety of options to specify resource requirements, including: minimum and maximum node count, processor count, specific nodes to use or not use, and specific node characteristics (so much memory, disk space, certain required features, etc.). A job can contain multiple job steps executing sequentially or in parallel on independent or shared resources within the job's node allocation.
 
<code> sshare </code> displays detailed information about <code> fairshare </code>  usage on the cluster. Note that this is only viable when using the priority/multifactor plugin.
 
<code> sstat </code> is used to get information about the resources utilized by a running job or job step.
 
<code> strigger </code> is used to set, get or view event triggers. Event triggers include things such as nodes going down or jobs approaching their time limit.
 
<code> sview </code> is a graphical user interface to get and update state information for jobs, partitions, and nodes managed by SLURM.
 
  
 +
echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
 +
echo "running on `echo $SLURM_JOB_NODELIST`"
 +
echo "work directory is `echo $SLURM_SUBMIT_DIR`"
 +
 +
sudo rm -rf 2D-lowest-f1
 +
sudo rm -rf tmp
 +
 +
nohup ~/gitlab-current/images/drone_image_tools/assemble-odm.sh -r lowest -i images -d 2 -e charliep@cs.earlham.edu &
 +
wait
 +
 +
exit 0
 +
</pre>
  
 
= About qsub =  
 
= About qsub =  
  
 
Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.
 
Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.
 +
 +
Tested and working 2022

Latest revision as of 13:57, 18 November 2022

This document presumes zero prior knowledge of cluster computing. If instead you're an intermediate user (e.g. you have an account and have run a few jobs before but need a reminder) the table of contents is your friend.


This document gives you all the information you need to choose a system, log in to a cluster/phat node, write a script, submit it via sbatch to the scheduler, and find the output. As such, these notes cover hardware and software. (If you're a sysadmin, you may be interested in this page instead.)

Prerequisites

  1. Get a cluster account. You can email admin at cs dot earlham dot edu or a current CS faculty member to get started. Your user account will grant access to all the servers below, and you will have a home directory at ~username that you can access when you connect to any of them.
    1. Note: if you have a CS account, you will use the same username and password for your cluster account.
  2. Connect through a terminal via ssh to username@hopper.cluster.earlham.edu. If you intend to work with these machines a lot, you should also configure your ssh keys.

Cluster systems to choose from

The cluster dot earlham dot edu domain consists of clusters (a collection of physical servers linked through a switch to perform high-performance computing tasks with distributed memory) and jumbo servers (nee "phat nodes"; a system comprising one physical server with a high ratio of disk+RAM to CPU, good for jobs demanding shared memory).

Our current machines are:

  • hamilton: 5 compute nodes, 256GB RAM per node; features most CPU cores per node and highest clock speed.
  • whedon: 7 compute nodes; 256GB of RAM per node.
  • layout: cluster; 4 compute nodes, pre-whedon, features NVIDIA GPGPU's and multiple CUDA options.
  • lovelace: jumbo server.
  • pollock: jumbo server, older than lovelace but well-tested and featuring the most available disk space.

To get to, e.g., whedon, from hopper, run ssh whedon.

If you're still not sure, click here for more detailed notes.

Cluster software bundle

The cluster dot earlham dot edu servers all run a supported CentOS version.

All these servers (unless otherwise noted) also feature the following software:

  • Slurm (scheduler): submit a job with sbatch jobname.sbatch, delete it with scancel jobID. Running a job has its own doc section below.
  • Environment modules: run module avail to see available software modules and module load modulename to load one; you may load modules in bash scripts and sbatch jobs as well.

The default shell on all these servers is bash.

The default Python version on all these servers is Python 2.x, but all have at least one Python 3 module with a collection of widely-used scientific computing libraries.

Using Slurm

Slurm is our batch scheduler.

You can check that it's working by running: srun -l hostname

You can submit a job in a script with the following: sbatch my_good_script.sbatch

Here's an example of a batch file, note that the time parameter may be too short for "real" runs:

#!/bin/sh
#SBATCH --time=20
#SBATCH --job-name hello-world
#SBATCH --nodes=1 
#SBATCH -c 1 # ask for one core
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=excellent_email_user@earlham.edu

echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

srun -l /bin/hostname
srun sleep 10           # Replace this sleep command with your command line. 
srun -l /bin/pwd

Interactive and command line interfaces also exist. After submitting a job slurm captures anything written to stdout and stderr by the programs and when the job completes puts it in a file called slurm-nnn.out (where nnn is the job number) in the directory where you ran sbatch. Use more to view it when you are looking for error messages, output file locations, etc.

If you are used to using qpeek, you can instead just run tail -f jobXYZ.out or tail -f jobXYZ.err.

There's some more CPU management information here.

Conversion from Torque to Slurm

To submit a job to Slurm, you'll need to write a shell script wrapper and submit it through sbatch on your system of choice. For people familiar with pbs the pattern is very similar. For example (change the specific options):


Commands
Torque Slurm Description
qsub sbatch run/submit a batch job
qstat squeue show jobs currently in the queue
qdel scancel cancel a job
pbsnodes -a scontrol show nodes show nodes in the cluster
qstat ... scontrol update NodeName=w[0-6] State=RESUME resurrect nodes that are offline
Environment Variables
Torque Slurm Description
$PBS_QUEUE $SLURM_JOB_PARTITION the queue/partition you are in
cat $PBS_NODEFILE $SLURM_JOB_NODELIST there's no equivalent of the nodes file but there is an environment variable that stores that information
$PBS_O_WORKDIR $SLURM_SUBMIT_DIR working directory from which the command was run

Example scripts

#!/usr/bin/bash

#SBATCH --job-name hello-world
#SBATCH --nodes=5
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=excellent_email_user@earlham.edu

echo "queue is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

srun -l echo "hello world!"

If your motivation is to run ODM inside a Docker container than this pattern should work on pollock and lovelace. Note that the log file is not created (interaction between nohup and slurm) but you can use/save the slurm-###.out file which has the same information in it.

Normally you preface each command in a slurm file with srun (slurm-run), with Docker/ODM this appears to make things go pear-shaped.

#!/bin/sh
#SBATCH --job-name stod-slurm-test-2D-lowest
#SBATCH --nodes=1 
#SBATCH -c 4 # ask for four cores
#SBATCH --mail-type=BEGIN,END,FAIL 
#SBATCH --mail-user=charliep@cs.earlham.edu

echo "queue/partition is `echo $SLURM_JOB_PARTITION`"
echo "running on `echo $SLURM_JOB_NODELIST`"
echo "work directory is `echo $SLURM_SUBMIT_DIR`"

sudo rm -rf 2D-lowest-f1
sudo rm -rf tmp

nohup ~/gitlab-current/images/drone_image_tools/assemble-odm.sh -r lowest -i images -d 2 -e charliep@cs.earlham.edu &
wait

exit 0

About qsub

Before Slurm we used Torque and its associated software, including qsub. This is now deprecated and should not be used on the Earlham CS cluster systems.

Tested and working 2022