Job Scheduler

BlueCrystal Phase 4 uses Slurm as the job scheduler. The job scheduler’s purpose is to find resources available suitable to run your code and then execute it.

Introduction to Slurm

Slurm is a job management application which allows you to take full advantage of BlueCrystal 4. You interact with it by writing a small shell script with some special commands which tell it how to run your code. You will pass this shell script to a command called sbatch which will ask Slurm to schedule your job. The reference documentation for sbatch can be found on the Slurm website and can be a useful source of information.

Terminology

For various historical reasons, Slurm uses some technical terms in a way which may not match what you’ve used in the past. For clarity, the terms as understood by the Slurm documentation are as follows:

Node
A physical computer sitting in a rack in our computer room. There are around 500 of these in BC4. Each of them has some CPUs, some RAM and some have GPUs inside.
Task
Each task in Slurm refers to a process running on a node. You might run 28 tasks on a node which communicate with each other via MPI or you might run 1 task which itself spins up 28 threads.
CPU
Most of the time, Slurm refers to a CPU core as a CPU. For example, the compute nodes on BC4 have two physical CPU chips (some would describe this as having two sockets), with 14 processing cores inside each; Slurm would say that this node has ‘28 CPUs’.
Partition
A set of nodes with associated restrictions on use. This is called a queue in some other systems. For example we have a test partition which only allows small, short jobs and we have a cpu partition which allows large, long jobs. Run sinfo -s to see a list of all the partitions you have access to.
Step
A job step is a part of a job. Each job will run its steps in order.

Looking at BC4

Before we start writing our job let’s take a quick look at how we can get some information about BC4. The first command to try is sinfo. If you run it you will see that it prints a lot of information so a more useful view is running sinfo -s which just prints a summary:

[root@bc4mgmt2 pillar]# sinfo -s
PARTITION      AVAIL  TIMELIMIT   NODES(A/I/O/T)  NODELIST
veryshort         up    6:00:00     430/9/15/454  compute[076-525],highmem[10-13]
test              up    1:00:00     430/9/15/454  compute[076-525],highmem[10-13]
cpu_test          up    1:00:00          0/2/0/2  compute[074-075]
cpu*              up 14-00:00:0     426/9/15/450  compute[076-525]
hmem              up 14-00:00:0          6/2/0/8  highmem[10-17]
gpu               up 7-00:00:00        21/8/1/30  gpu[01-30]
gpu_veryshort     up    1:00:00          0/2/0/2  gpu[31-32]
serial            up 3-00:00:00          4/2/0/6  compute[068-073]
dcv               up 14-00:00:0          1/0/0/1  bc4vis1

There’s still a lot of information here but let’s go through it.

  • The first column is the list of the partitions. By default you should see all six of these but you may also have access to more. The asterisk next to cpu tells us that this is the default partition that will be used if you don’t specify one.
  • The availability column shows whether that partition is currently available for use. Hopefully they will all always be available unless there is some scheduled downtime.
  • The time limit shows the maximum time that a job running in that partition is allowed to run for. It will usually be in days-hours:minutes:seconds format.
  • The nodes column tells you how many nodes are in the various states. Allocated (A) are nodes that are currently running jobs, idle (I) are currently not running anything but are available to do so, other (O) are likely in some sort of maintenance mode and total (T) is the total number of nodes in that partition.
  • Finally the node list tells you the names of all the nodes which are part of that partition.

The cpu, hmem, and gpu partitions are differentiated by the type of nodes comprising them. The veryshort and test partitions contain a mix of resources but are more restricted in their available run-time. The gpu_veryshort partition has very restricted run-time and is for testing jobs requiring gpus. The serial partition is intended for serial jobs and allows a greater number to run simultaneously with a restricted run-time. The dcv partition is only usable by the visualisation system.

Our first job script

For starters we will create a simple Slurm script and introduce the things they can do. There are two main parts to a Slurm script, the sbatch configuration flags and the steps you want to run.

sbatch configuration flags

The configuration flags are defined in your submission script by putting a line beginning with #SBATCH followed by an sbatch flag as defined in the documentation. For example, you name your job in Slurm by using the --job-name flag so the config line would be #SBATCH --job-name=myjob.

There are some flags which you want to always set and there are some which you can use if you wish. The things you want to make sure you’re setting are:

  • the name of the job (--job-name),
  • the number of nodes the job requires (--nodes),
  • the number of tasks to run on each node (--ntasks-per-node),
  • the number of CPUs to devote to each task (--cpus-per-task),
  • the time the job will run for (--time),
  • and the amount of memory the job will require per node (--mem) or per CPU.

Slurm will use these flags to decide when and where your job will run. The values to use for --nodes, --ntasks-per-node and --cpus-per-task will be defined by the type of task you are running and what your software supports (i.e. is it multi-threaded, does it support MPI). If you exceed the limits you put for --time and --mem then your job will be killed so make sure that they are set large enough but also bear in mind that on average, the smaller they are set, the sooner your job will likely start to run.

There are lots of useful flags you can set which are all detailed in the documentation. Here are some selected flags which may be relevant to you:

--dependency
Set example, set this job to only start after another job has finished. This can be useful if you have a complex workflow which has certain parts depending on others.
--workdir
Set the working directory of your script. By default this will be the directory in which you run sbatch.
--output and --error
Set the output of the standard output and error streams of your job.
--exclusive
Make sure that this job has sole use of any nodes it runs on.
--gres
This lets you specify ‘generic resources’ and is necessary for using GPUs.

The beginning of our test job will then start to look like:

#!/bin/bash

#SBATCH --job-name=test_job
#SBATCH --partition=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --mem=100M

Taking these one at a time:

  • We start by setting the job name to something meaningful to us. It’s useful to put the name of the software you’re running in here as well as some summary of the configuration you’re running.
  • We select the test partition since we’re just running a small test job.
  • Next we tell Slurm about the shape of our job. In this case we are just going to run two processes (tasks) on two nodes and each task only going to require a single CPU. We are therefore going to be running a total of 4 tasks (and requiring 4 CPUs).
  • It’s only a very small short job so we choose a run time of 10 seconds and a memory usage of 100M.

It might seem like there are quite a lot of sbatch configuration lines but if you get in the habit of always defining these options, you’ll always know exactly what you’re asking for rather than depending on the defaults.

Job steps

The sbatch flags you set will allow Slurm to set aside some nodes for your job to run on when they become available. This list of nodes as well as how many cores on each of those nodes are together referred to by Slurm as your job’s allocation. Your job will run within the allocation and Slurm will try and help make sure that you do not go outside it.

Continuing our example from before:

#!/bin/bash

#SBATCH --job-name=test_job
#SBATCH --partition=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --mem=100M

echo 'My first job'
srun hostname

The commands that make up the second part of the Slurm submission script will be run, like a bash script, on a single node of your allocation. This machine (sometimes called the ‘batch host’) is the one that’s in charge of starting the processes and collating the output of your commands. This means that the echo command will only be run once, on the batch host.

For the purpose of this test, we will not be running a real peice of analysis software but will simply run the command hostname. This command prints the name of the computer it’s run on so running it on the BC4 login node will print something like bc4login2.bc4.acrc.priv. Since we want to run our hostname command 4 times (two nodes × two tasks) we must label it as a ‘job step’ by prefixing the line with srun. srun will run its commands across the allocation, once for each task.

Submitting a job

If we save that file as test_slurm.sh then we can submit it to Slurm using the sbatch command:

$ sbatch test_slurm.sh
Submitted batch job 433000

The line it prints tells you the ID number of the job. Make a note of this as this is how you can refer to the job.

For a small job like this you can expect it to start running very quickly, perhaps even immediately. For the larger jobs that you will normally run on BC4, it can take some time before it starts running. You can check the status of your waiting jobs by using squeue --state=PENDING and you can kill any running or pending job using scancel <jobid> (where <jobid> should be replaced with the ID of the job you want to cancel).

To keep track of how our job is progressing we can use the sacct command:

$ sacct -j 433000
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
433000         test_job       test       root          4  COMPLETED      0:0
433000.batch      batch                  root          2  COMPLETED      0:0
433000.0       hostname                  root          4  COMPLETED      0:0

The three lines in the table refer to different steps of the job:

  • The first (the one that just starts with 433000) is the summary for the overall job.
  • The line below that (433000.batch) is telling you about the status of the batch host and is not usually very important (unless it fails).
  • The third line (433000.0) is telling you about the job step. They will be numbered from 0, one for each srun command in your submission script. You will that the job name for the step is simply the command you ran. Looking across the table you see that AllocCPUS is 4 which matches what we wanted.

We also see that all the steps are marked as ‘completed’ so let’s have a look at the output. By default Slurm will put the output in a file called slurm-<jobid>.out (where <jobid> is replaced by the job ID) so we’ll find it in slurm-433000.out:

$ cat slurm-433000.out
My first job
compute277.bc4.acrc.priv
compute277.bc4.acrc.priv
compute278.bc4.acrc.priv
compute278.bc4.acrc.priv

We can see here that the echo command has just been run once (on the batch host) but hostname has been run four times. Looking at the nodes it ran on, we see that it ran twice on compute277 and twice on compute278 which matches our request to run on two nodes with two tasks per node.

This has hopefully given you an understanding of the basics of writing a Slurm submission script, running it and getting the output. Let’s move on now to some example of some more specific examples of usage.

Running parallel

There are lots of different ways to run your code in parallel ranging from the simple to the advanced.

Array jobs

We will start with the simplest parallelism method available which is array jobs. This isn’t really a type of parallelism but does have some of the concepts in common.

The problem that array jobs are solving is quite a common one in scientific analysis: you have a single script which takes some arguments, for example an atomic simulation which you can run with different starting temperatures. Using an array job you can set the simulation to run a large number of times with different parameters using only one Slurm script and command:

#!/bin/bash

#SBATCH --job-name=test_job
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --mem=100M
#SBATCH --array=100-149

srun my-analysis-program --temerature=$SLURM_ARRAY_TASK_ID

We’ve passed the flag --array=100-149 which tells Slurm how many times this array job should run. This will run this script 50 times, each time setting SLURM_ARRAY_TASK_ID to values from 100 to 149.

The output of each subjob will be put into a file called slurm-<jobid>_<subjobid>.out.

OpenMP jobs (and other multi-threaded)

OpenMP is the most common method used for multi-threading programs. There are various other (Intel TBB, Boost etc.) but they all have similar behaviours as far as Slurm is concerned. The main message is that you need to tell your program how many threads to run. Many programs will by default grab as many threads as possible but if you are sharing the node with someone else then you will interfere with each other.

To set the number of threads in an OpenMP program, you must set the OMP_NUM_THREADS environment variable. Check the documentation for your software to find out how to make it work in your case. For this example we will assume you are running an OpenMP program.

Our example program here (hello-omp) will simply print to the standard output how many threads it has available to run.

#!/bin/bash

#SBATCH --job-name=omp-test
#SBATCH --partition=test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=28
#SBATCH --time=0:0:10
#SBATCH --mem=100M

# Load modules required for runtime e.g.
module load languages/intel/2017.01

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
srun ./hello-omp

The output file for this job will look like:

Num threads: 28

MPI jobs

MPI is a technology which allows separate processes to communicate with each other. This can work within one node or between nodes.

For our example we’ll put together a simple MPI application which simply says hello from each process. We want to run it on the entirety of two nodes so each node should be running 28 tasks.

#!/bin/bash

#SBATCH --job-name=mpi-test
#SBATCH --partition=test
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=28
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --mem-per-cpu=100M

# Load modules required for runtime e.g.
module load languages/intel/2017.01

export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
srun ./hello-mpi

Note that we also changed the --mem flag to --mem-per-cpu since for MPI jobs, that is often the known parameter. The other important thing we have to do is set the I_MPI_PMI_LIBRARY environment variable which tells Slurm to use the correct process management interface.

Running this job will give an output of something like (truncated):

Hello world from processor compute447.bc4.acrc.priv, rank 2 out of 56 processors
Hello world from processor compute451.bc4.acrc.priv, rank 31 out of 56 processors
Hello world from processor compute447.bc4.acrc.priv, rank 10 out of 56 processors
Hello world from processor compute447.bc4.acrc.priv, rank 11 out of 56 processors
Hello world from processor compute447.bc4.acrc.priv, rank 27 out of 56 processors
Hello world from processor compute451.bc4.acrc.priv, rank 49 out of 56 processors
...

Which is a line from each process telling is its individual rank as well as the fact that there are 56 (28×2) total processes.

For more details about running MPI with Slurm, see the MPI guide on the Slurm website.

Hybrid OpenMP+MPI jobs

TODO

GPU jobs

There is a separate “partition” (i.e. queue) for the nodes that contain GPU cards. To run a job using both Nvidia Pascal P100 cards on one node use a submission script similar to this:

#!/bin/bash -login
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --partition gpu
#SBATCH --job-name=gpujob

module load CUDA

cd $SLURM_SUBMIT_DIR
./gpujob

Slurm for Speakers of Other (Scheduler) Languages

The “Rosetta Stone of Workload Managers” is a useful resource for translating commands and scripts to Slurm if you already know one of PBS/Torque, LSF, SGE or LoadLeveler

Here is another centre’s direct PBS to Slurm translation

Here is an example of converting a BC3 job (on the left) to a BC4 one (on the right):

#!/bin/bash                                     #!/bin/bash

#PBS -N Ri                                      #SBATCH --job-name=Ri
#PBS -o OUT                                     #SBATCH -o OUT
#PBS -q himem                                   #SBATCH -p hmem
## For BC3:   (5 days=5*24)                     ## For BC4:
#PBS -l nodes=1:ppn=16,walltime=120:00:00       #SBATCH --nodes=1 --tasks-per-node=28 --time=5-00:00:00

#! Mail to user if job aborts                   #! Mail to user if job aborts
#PBS -m a                                       #SBATCH --mail-type=FAIL

#! application name                             #! application name
application="./ri"                              application="./ri"

#! Run options for the application              #! Run options for the application
options=""                                      options=""

#####################################           #####################################
### You should not have to change ###           ### You should not have to change ###
###   anything below this line    ###           ###   anything below this line    ###
#####################################           #####################################

#! change the working directory                 #! change the working directory
# (default is home directory)                   # (default is home directory)

cd $PBS_O_WORKDIR                               cd $SLURM_SUBMIT_DIR

echo Running on host `hostname`                 echo Running on host `hostname`
echo Time is `date`                             echo Time is `date`
echo Directory is `pwd`                         echo Directory is `pwd`
echo PBS job ID is $PBS_JOBID                   echo Slurm job ID is $SLURM_JOBID
echo This jobs runs on the following machines:  echo This jobs runs on the following machines:
echo `cat $PBS_NODEFILE | uniq`                 echo $SLURM_JOB_NODELIST

#! Run the threaded exe                         #! Run the threaded exe
$application $options                           $application $options