Slurm Partitions, QoS and Resource Limits

Partitions

HPC4 provides different partitions (queues) for different compute nodes with varying job sizes and wall times. There are also limits on the number of jobs queued and running on a per-user and queue basis. Queues and limits are subject to change based on the demand and workload distribution.

Partition Max resources
per User
Max nodes
per Job
Max duration
(wall time)
Max
running jobs
Max running +
submitted jobs
intel 256 cores 2 nodes 120 hours 10 15
amd 1536 cores 6 nodes 120 hours 20 30
gpu-a30 16 GPUs (A30) 4 nodes 72 hours 4 8
gpu-l20 8 GPU (L20) 2 nodes 72 hours 2 4

 

Quality of Service (QoS)

The resource limits for jobs are controlled by the Quality of Service (QoS) in Slurm. A set of default QoS have been configured, dictating the limit in the resources and partitions that a job is entitled to request.  In general, users do not need to deal with this unless you have been approved to use another QoS in some scenario.  As such, you need to supply the QoS parameter in the job request. One of the examples is the interactive job for debugging purpose in all partitions.

QoS Supported
Partitions
Max duration Max resources
per Job
Max
running jobs
Max running +
submitted jobs
debug intel, amd, gpu-a30, gpu-l20 4 hours 2 nodes, 2 GPUs 1 1