Skip Nav

Guide to the SLURM Queuing System
on Reef

Table of Contents

1. Introduction

On large-scale computers, many users must share available resources. Because of this, you cannot just log on to one of these systems, upload your programs, and start running them. Essentially, your programs (called batch jobs) have to "get in line" and wait their turn. And, there is more than one of these lines (called queues) from which to choose. Some queues have a higher priority than others (like the express checkout at the grocery store). The queues available to you are determined by the projects that you are involved with.

The jobs in the queues are managed and controlled by a batch queuing system, without which, users could overload systems, resulting in tremendous performance degradation. The queuing system will run your job as soon as it can while still honoring the following:

  • Meeting your resource requests
  • Not overloading systems
  • Running higher priority jobs first
  • Maximizing overall throughput

We use the SLURM Professional queuing system. The SLURM module should be loaded automatically for you at login, allowing you access to the SLURM commands.

2. Anatomy of a Batch Script

A batch script is simply a small text file that can be created with a text editor such as vi or notepad. You may create your own from scratch, or start with one of the sample batch scripts available in $SAMPLES_HOME. Although the specifics of a batch script will differ slightly from system to system, a basic set of components are always required, and a few components are just always good ideas. The basic components of a batch script must appear in the following order:

  • Specify Your Shell
  • Required SLURM Directives
  • The Execution Block

Note: Not all applications on Linux systems can read DOS-formatted text files. SLURM does not handle ^M characters well, nor do some compilers. To avoid complications, please remember to convert all DOS-formatted ASCII text files with the dos2unix utility before use on any HPC system. Users are also cautioned against relying on ASCII transfer mode to strip these characters, as some file transfer tools do not perform this function.


2.1. Specify Your Shell

First of all, remember that your batch script is a script. It's a good idea to specify which shell your script is written in. Unless you specify otherwise, SLURM will use your default login shell to run your script. To tell SLURM which shell to use, start your script with a line similar to the following, where shell is either bash, sh, ksh, csh, or tcsh:

#!/bin/shell

2.2. Required SLURM Directives

The next block of your script will tell SLURM about the resources that your job needs by including SLURM directives. These directives are actually a special form of comment, beginning with "#SBATCH". As you might suspect, the # character tells the shell to ignore the line, but SLURM reads these directives and uses them to set various values. IMPORTANT!! All SLURM directives MUST come before the first line of executable code in your script, otherwise they will be ignored.

Every script must include directives for the following:

  • The number of cores per node
  • The number of nodes and processes per node you are requesting
  • How nodes should be allocated
  • The maximum amount of time your job should run
  • Which queue you want your job to run in
  • Your Project ID

SLURM also provides additional optional directives. These are discussed in Optional SLURM Directives, below.

Number of Nodes and Processes Per Node

Before SLURM can schedule your job, it needs to know how many nodes you want. Before your job can be run, it will also need to know how many processes you want to run on each of those nodes. In general, you would specify one process per core, but you might want more or fewer processes depending on the programming model you are using. See Example Scripts (below) for alternate use cases.

#SBATCH --ntasks=76 # Number of MPI tasks (i.e. processes)
#SBATCH --nodes=2 # Max number of nodes to be allocated
#SBATCH --ntasks-per-node=38 # Max number of tasks on each node
#SBATCH --ntasks-per-socket=19 # Max number of tasks on each socket

How Nodes Should Be Allocated

Some default behaviors in SLURM have the potential to seriously impair the ability of your scripts to run in certain situations, and could impose restrictions on submitted jobs that might cause them to wait much longer in the queue than necessary. To prevent these situations from occurring, the following SLURM directive is required in all batch scripts on Reef:

#SBATCH --distribution=cyclic:cyclic
# Distribute tasks cyclically first among nodes
# and then among sockets within a node

For an explanation of what this directive means, see the “sbatch” man page.

2.2.1. How Many Nodes and Cores to Run

Before SLURM can schedule your job, it needs to know how many nodes you want. Before your job can be run, it will also need to know how many processes you want to run on each of those nodes. In general, you would specify one process per core, but you might want more or fewer processes depending on the programming model you are using. See Example Scripts (below) for alternate use cases.

Example: Serial code using all of the node's cores.

#SBATCH --nodes=1
#SBATCH --tasks-per-node=38

2.2.2. The Execution Block

Because only one job per node can be scheduled on Reef, the following SLURM directive is required in all batch scripts:

#SBATCH -l place=scatter:excl

For an explanation of what this directive means, see the srun man page.

2.2.3. How Long to Run

Next, SLURM needs to know how long your job will run. For this, you will have to make an estimate. There are three things to keep in mind.

  • Your estimate is a limit. If your job hasn't completed within your estimate, it will be terminated.
  • Your estimate will affect how long your job waits in the queue. In general, shorter jobs will run before longer jobs.
  • Each queue has a maximum time limit. You cannot request more time than the queue allows.
#SBATCH --time=NN:NN:NN
2.2.4. Which Partition to Run In

Now, SLURM needs to know which queue you want your job to run in. Your options here are determined by current cluster topology and project usage of cluster resources. Currently Reef is partitioned by non-gpu and the 2 available GPU types. Other queues may be created and access to these queues restricted to projects that have been granted special privileges due to urgency or importance, and they will not be discussed here.

To see the list of queues available on the system, use the sinfo command. To specify the queue you want your job to run in, include the following directive:

#SBATCH --partition=standard # Run job in the CPU only partition
2.2.6. Your Project ID

SLURM now needs to know which project ID to charge for your job. You can use the show_usage command to find the projects that are available to you and their associated project IDs. In the show_usage output, project IDs appear in the column labeled "Subproject." Note: Users with access to multiple projects should remember that the project they specify may limit their choice of queues.

To specify the Project ID for your job, include the following directive:

SBATCH --account=Project_ID

3. Submitting Your Job

Once your batch script is complete, you will need to submit it to SLURM for execution using the srun command. For example, if you have saved your script into a text file named run.SLURM, you would type:

sbatch run.SLURM

Occasionally you may want to supply one or more directives directly on the qsub command line. Directives supplied in this way override the same directives if they are already included in your script. The syntax to supply directives on the command line is the same as within a script except that you will use "srun" instead of "sbatch" and "#SBATCH" is not used. For example:

srun --account=PROJECT_ID --partition=PARTITION --nodes=N --tasks-per-node=N --time=NN:NN:NN mpirun -n N mpi_array

4. Simple Batch Script Example

The batch script below contains all of the required directives and common script components discussed above.

#!/bin/bash
# Required
#SBATCH --nodes=N
#SBATCH --tasks-per-node=N
#SBATCH --distribution=cyclic:cyclic
#SBATCH --time=NN:NN:NN
#SBATCH --partition=PARTITION
#SBATCH --account="PROJECT_ID"
#SBATCH --exclusive
# Optional
#SBATCH --job-name=STRING
#SBATCH --output=STRING_%j.out
#SBATCH --mail-user="EMAIL_ADDRESS"
#SBATCH --mail-type=BEGIN,END,FAIL

# Load desires modules
. /cm/local/apps/environment-modules/4.2.1/init/bash
module purge
module load gcc/8.2.0 slurm/18.08.9 openmpi/4.0.3-aspen

# Set Optional Environment Variables
export MPI_DISPLAY_SETTINGS=1
export MPI_DSM_VERBOSE=1
export MPI_LAUNCH_TIMEOUT=300
export MV2_SUPPRESS_CUDA_USAGE_WARNING=1

# Executable
mpirun -n N ./MPI_BINARY

5. Job Management Commands

The table below contains commands for managing your jobs in SLURM.

Job Management Commands
CommandDescription
srun Submit a job.
qstat Check the status of a job.
qview A more user-friendly version of qstat.
qstat -q Display the status of all SLURM queues.
sinfo A more user-friendly version of "qstat -q".
qdel Delete a job.
qhold Place a job on hold.
qrls Release a job from hold.
tracejob Display job accounting data from a completed job.
SLURMnodes Display host status of all SLURM batch nodes.
apstat Display attributes of and resources allocated to running jobs.
qpeek Lets you peek at the stdout and stderr of your running job.

6. Optional SLURM Directives

In addition to the required directives mentioned above, SLURM has many other directives, but most users will only use a few of them. Some of the more useful optional directives are listed below.

6.1. Job Identification Directives

Job identification directives allow you to identify characteristics of your jobs. These directives are voluntary, but strongly encouraged. The following table contains some useful job identification directives.

Job Identification Directives
DirectiveOptionsDescription
--job_name   Name your job.
6.1.1. Job Name

The "-N" directive allows you to designate a name for your job. In addition to being easier to remember than a numeric job ID, the SLURM environment variable, $SLURM_JOBNAME, inherits this value and can be used instead of the job ID to create job-specific output directories. To use this directive, add a line in the following form to your batch script:

#SBATCH --job-name=STRING
Or to your srun command
srun --job-name=STRING...

6.2. Job Environment Directives

Job environment directives allow you to control the environment in which your script will operate. The following table contains a few useful job environment directives.

6.2.1. Interactive Batch Shell

The "-I" directive allows you to request an interactive batch shell. Within that shell, you can perform normal Unix commands, including launching parallel jobs. To use "-I", append it to the end of your qsub request. You may also use the "-X" option to allow for X-Forwarding to run X-Windows-based Graphical interfaces on the compute node, such as the TotalView debugger. For example:

srun --account=Project_ID --partition=PARTITON --nodes=N --tasks-per-node=N --time=NN:NN:NN --pty /bin/bash -i

6.3. Reporting Directives

Reporting directives allow you to control what happens to standard output and standard error messages generated by your script. They also allow you to specify e-mail options to be executed at the beginning and end of your job.

6.3.1. Redirecting Stdout and Stderr

By default, messages written to stdout and stderr are captured for you in files named slurm-JOBID.out or STRING.out, respectively, where JOBIDis the ID of the job and STRING is the name specified with the "--output" directive.

For example of custom name which includes JOBID:

#SBATCH --output=mpi_array_%j.out

6.3.2. Setting up E-mail Alerts

Many users want to be notified when their jobs begin, end or fails.

For Example:

#SBATCH --mail-user="joe.user.ctr@hpcmp.hpc.mil" #SBATCH --mail-type=BEGIN,END,FAIL

6.4. Job Dependency Directives

Job dependency directives allow you to specify dependencies that your job may have on other jobs. This allows users to control the order jobs run in. These directives will generally take the following form:

#SBATCH -W depend=dependency_expression

where dependency_expression is a comma-delimited list of one or more dependencies, and each dependency is of the form:

type:jobids

where type is one of the directives listed below, and jobids is a colon-delimited list of one or more job IDs that your job is dependent upon.

Job Dependency Directives
DirectiveDescription
after Execute this job after listed jobs have begun.
afterok Execute this job after listed jobs have terminated without error.
afternotok Execute this job after listed jobs have terminated with an error.
afterany Execute this job after listed jobs have terminated for any reason.
before Listed jobs may be run after this job begins execution.
beforeok Listed jobs may be run after this job terminates without error.
beforenotok Listed jobs may be run after this job terminates with an error.
beforeany Listed jobs may be run after this job terminates for any reason.

For example, run a job after completion (success or failure) of job ID 1234:

#SBATCH -W depend=afterany:1234

Or, run a job after successful completion of job ID 1234:

#SBATCH -W depend=afterok:1234

For more information about job dependencies, see the srun man page.

7. Environment Variables

7.1. SLURM Environment Variables

While there are many SLURM environment variables, you only need to know a few important ones to get started using SLURM. The table below lists the most important SLURM environment variables and how you might generally use them.

Frequently Used SLURM Environment Variables
SLURM VariableDescription
$SLURM_JOBID Job identifier assigned to job or job array by the batch system.
$SLURM_O_WORKDIR The absolute path of directory where srun was executed.
$SLURM_JOBNAME The job name supplied by the user.

The following additional SLURM variables may be useful to some users.

Other SLURM Environment Variables
SLURM VariableDescription
$SLURM_ARRAY_INDEX Index number of subjob in job array.
$SLURM_ENVIRONMENT Indicates job type: SLURM_BATCH or SLURM_INTERACTIVE
$SLURM_NODEFILE Filename containing a list of vnodes assigned to the job.
$SLURM_O_HOST Host name on which the srun command was executed.
$SLURM_O_PATH Value of PATH from submission environment.
$SLURM_O_SHELL Value of SHELL from submission environment.
$SLURM_QUEUE The name of the queue from which the job is executed.

7.2. Other Important Environment Variables

In addition to the SLURM environment variables, the table below lists a few other variables which are not specifically associated with SLURM. These variables are not generally required, but may be important depending on your job.

Other Important Environment Variables
VariableDescription
$OMP_NUM_THREADS The number of OpenMP threads per node.
$MPI_DSM_DISTRIBUTE Ensures that memory is assigned closest to the physical core where each MPI process is running.
$MPI_GROUP_MAX Maximum number of groups within a communicator.