HPC4 - USE OF SLURM JOB SCHEDULING SYSTEM

The Simple Linux Utility for Resource Management (SLURM) serves as the designated resource management and job scheduling system within the cluster. It is mandatory for all jobs within the cluster to be executed through SLURM. To initiate a job or application, it is imperative to submit a job script to SLURM.

A SLURM script encompasses three essential aspects:

Prescribing the resource requirements for the job: The script explicitly defines the necessary resources and specifications required for the successful execution of the job, such as CPU cores, memory allocation, time limits, and any other relevant constraints.
Setting the environment: The script ensures the establishment of an appropriate and tailored execution environment for the job by configuring variables, module dependencies, paths, and other relevant settings necessary for seamless execution.
Specifying the work to be carried out: The script outlines the specific tasks and procedures to be executed in the form of shell commands. It provides a clear and concise set of instructions that guide the execution of the job, enabling the system to carry out the desired computational operations and produce the intended results.

Common operations with SLURM

Purpose	Command
To check what queues (partitions) are available:	sinfo
To submit job:	sbatch <your_job_script>
To view the queue status:	squeue
To view the queue status of your job:	squeue -u $USER
To cancel a running or pending job:	scancel <your_slurm_jobid>
To view detailed information of your job:	scontrol show job <your_slurm_jobid>

Common SLURM Job Directives

Purpose	Options	Examples
Job's name defined by user	--job-name	--job-name=myjob
Partition of the job allocated (intel/amd/gpu-a30/gpu-l20)	--partition	--partition=intel Note: When selecting gpu partitions, you must* also use options like --gpus-per-node to obtain at least one GPU.
Account to be charged for resources used	--account	--account=mygroup
Max execution time (Walltime)	--time=D-HH:MM:SS	--time=1-01:10:30
Nodes required	--nodes	--nodes=2
Number of tasks (MPI workers) per node	--ntasks-per-node	--ntasks-per-node=4
Number of CPUs (OMP threads) per task	--cpus-per-task	--cpus-per-task=64
Number of CPUs (OMP threads) per gpu	--cpus-per-gpu	--cpus-per-gpu=16
GPUs per node	--gpus-per-node	--gpus-per-node=4
GPUs per task	--gpus-per-task	--gpus-per-task=1
Quality of Service	--qos	--qos=debug

You may check additional options here.

Use case example GPU(Simple python job)

The demo python is shown here for convenience, save it as matrix_inverse.py under your working directory:

import numpy as np

print("N=3")

N = 3
X = np.random.randn(N, N)
print("X =\n", X)
print("Inverse(X) =\n", np.linalg.inv(X))


print("N=10")
N = 10
X = np.random.randn(N, N)
print("X =\n", X)
print("Inverse(X) =\n", np.linalg.inv(X))


print("N=100")

N = 100
X = np.random.randn(N, N)
print("X =\n", X)
print("Inverse(X) =\n", np.linalg.inv(X))

#!/bin/bash
#SBATCH --job-name=matinv              # create a short name for your job
#SBATCH --nodes=1                      # node count
#SBATCH --ntasks-per-node=1            # number of tasks per node (adjust when using MPI)
#SBATCH --cpus-per-task=4              # cpu-cores per task (>1 if multi-threaded tasks, adjust when using OMP)
#SBATCH --gpus-per-node=1              # Number of GPUs for the task
#SBATCH --time=01:20:00                # total run time limit (HH:MM:SS)
#SBATCH --partition=gpu-a30            #The partition(queue) where you submit
#SBATCH --account=<projectgroupname>   #Specify project group account


### Your commands for the tasks
nvidia-smi
python matrix_inverse.py
###############################

The first line of a Slurm script specifies the Unix shell to be used.
This is followed by a series of #SBATCH directives which set the resource requirements and other parameters of the job.
--nodes is for number of nodes for the task.
--ntasks-per-node is for setting number of tasks per node (usually means MPI ranks).
--cpus-per-task is for setting number of threads per task (usually means OMP threads). Some libraries, for example python's numpy, will be affected by this option. For jobs with GPU, you may use --cpus-per-gpu.
--gpus-per-node is for selecting number of gpus per node.You can also specify the GPU to use with the option --gpus-per-node=a30:4. However, currently, we only have homogeneous machines, meaning all GPUs are of the same type on a single node. Sometimes, --gpus-per-task would be useful for allocate gpus to tasks. Using --gres is possible but not recommanded.
--partition is for selecting the partition for submission.
The necessary changes to the environment are made by loading the python module.
Lastly, the work to be done, which is the execution of a Python script, is specified in the final line.
Run "sbatch <filename>" to submit job.
"squeue" enable user to check job status.
"scancel <jobid>" to cancel a job.
"sinfo" to check node availability.

Use case example CPU(Simple python job)

#!/bin/bash
#SBATCH --job-name=matinv            # create a short name for your job
#SBATCH --nodes=1                    # node count
#SBATCH --ntasks-per-node=1          # number of tasks per node (adjust when using MPI)
#SBATCH --cpus-per-task=128          # cpu-cores per task (>1 if multi-threaded tasks, adjust when using OMP)
#SBATCH --time=01:20:00              # total run time limit (HH:MM:SS)
#SBATCH --partition=intel            # The partition(queue: intel/amd/gpu-a30/gpu-l20) where you submit
#SBATCH --account=<pi_account_name>  #Specify project group account


### Your commands for the tasks
python matrix_inverse.py
###############################

Result of squeue:

JOBID PARTITION     NAME     USER     ST       TIME  NODES NODELIST(REASON)

 522   intel      matinv kcalexla      R       5:17      1 cpu01

Use "cat" to print output from slurm-jobid.out
You may download the following tools for Job script generation generator(https://github.com/BYUHPC/BYUJobScriptGenerator.git)

N=3
X =
 [[-0.74054344 -0.33695325 -1.80687036]
 [-0.23310079  0.41634362  2.12752795]
 [-1.43863402 -0.96117331  0.38851044]]
Inverse(X) =
 [[-1.04068167 -0.88078301 -0.01669555]
 [ 1.40075039  1.3615892  -0.94165995]
 [-0.3881393   0.10707253  0.1824476 ]]

Additional Slurm script examples

Example 1: create a slurm script to run 2 applications (each application can use 2 CPU cores and 1 GPU device) in parallel.

#!/bin/bash

# NOTE: Lines starting with "#SBATCH" are valid SLURM options.
#       Lines starting with "#" and "##SBATCH" are comments.  
#       Uncomment a "##SBATCH" line (i.e. remove one #) to #SBATCH
#       means turn a comment to a SLURM option.

#SBATCH --job-name=slurm_job                # Slurm job name
#SBATCH --time=12:00:00                     # Set the maximum runtime
#SBATCH --partition=<partition_to_use>      # Choose partition
#SBATCH --account=<account_name>            # Specify project account

# Resource allocation 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=4 
#SBATCH --cpus-per-gpu=16
#SBATCH --gpus-per-node=2

# Uncomment to enable email notificaitons
# Remember to update your email address
##SBATCH --mail-user=user_name@ust.hk 
##SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE # Feel free to remove any


# Setup runtime environment if necessary
# For example,
module load openmpi

# Go to the job submission directory and run your application
cd $HOME/apps/myapp

# Execute applications in parallel
srun --ntasks=2 --gpus-per-task=1 --cpus-per-gpu=16 myapp1 &    # Assign 2 CPU cores and 1 GPU device to run application "myapp1"
srun --ntasks=2 --gpus-per-task=1 --cpus-per-gpu=16 myapp2      # Similarly, assign 2 CPU cores and 1 GPU device to run application "myapp2"

wait

Example 2: create a slurm script for a GPU application.

#!/bin/bash 
# NOTE: Lines starting with "#SBATCH" are valid SLURM options. 
# Lines starting with "#" and "##SBATCH" are comments.
# Uncomment a "##SBATCH" line (i.e. remove one #) to #SBATCH 
# means turn a comment to a SLURM option.

#SBATCH --job-name=slurm_job                 # Slurm job name 
#SBATCH --time=12:00:00                      # Set the maximum runtime 
#SBATCH --partition=<partition_to_use>       # Choose partition
#SBATCH --account=<account_name>             # Specify project account

# Resource allocation
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --gpus-per-node=4

# Uncomment to enable email notificaitons
# Remember to update your email address 
##SBATCH --mail-user=user_name@ust.hk 
##SBATCH --mail-type=BEGIN,END,FAIL,REQUEUE # Feel free to remove any 

# Setup runtime environment if necessary
# For example,
module load openmpi

# Go to the job submission directory and run your application 

cd $HOME/apps/slurm
./your_gpu_application

Interactive job

The basic procedure is:

Log in to a HPC machine
Request compute resources using
- srun (run at current terminal after resources being allocated), or
- salloc (manually ssh into machine after resources being allocated).

For example:

$ srun --partition=gpu-a30 --gpus-per-node=4 --account=<projectgroupname> --pty bash

All #SBATCH options are vaild srun/salloc options. --pty bash means obtain a interactive shell. You may also execute the command directly:

$ srun --partition=gpu-a30 --gpus-per-node=4 --account=<projectgroupname> ./myapp

Start your program

[user@gpu01 ~]$ python
Python 3.12.4 (main, Jun 24 2024, 22:04:18) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> 1+1
2
>>>

Check GPU usage for the job

Use this command:

srun --jobid=<jobid> -w <nodelist> --overlap --pty bash -i

replace the jobid and nodelist with the job you want to check.

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               548       gpu     bash kcalexla  R       5:13      1 gpu01

Tue Jun 25 09:57:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L20                     On  | 00000000:17:00.0 Off |                    0 |
| N/A   28C    P8              23W / 350W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  0      0    0     814970     C                                                6400MiB|
+---------------------------------------------------------------------------------------+

References: