Slurm Partitions, QoS and Resource Limits

Partitions

HPC4 provides different partitions (queues) for different compute nodes with varying job sizes and wall times. There are also limits on the number of jobs queued and running on a per-user and queue basis. Queues and limits are subject to change based on the demand and workload distribution.

Partition	Max resources per User	Max nodes per Job	Max duration (wall time)	Max running jobs	Max running + submitted jobs
intel	256 cores	2 nodes	120 hours	10	15
amd	1536 cores	6 nodes	120 hours	20	30
gpu-a30	16 GPUs (A30)	4 nodes	72 hours	4	8
gpu-l20	8 GPU (L20)	2 nodes	72 hours	2	4

Quality of Service (QoS)

The resource limits for jobs are controlled by the Quality of Service (QoS) in Slurm. A set of default QoS have been configured, dictating the limit in the resources and partitions that a job is entitled to request. In general, users do not need to deal with this unless you have been approved to use another QoS in some scenario. As such, you need to supply the QoS parameter in the job request. One of the examples is the interactive job for debugging purpose in all partitions.

QoS	Supported Partitions	Max duration	Max resources per Job	Max running jobs	Max running + submitted jobs
debug	intel, amd, gpu-a30, gpu-l20	4 hours	2 nodes, 2 GPUs	1	1

Partitions

Quality of Service (QoS)

ITSC Chatbot

close