Partitions
HPC4 provides different partitions (queues) for different compute nodes with varying job sizes and wall times. There are also limits on the number of jobs queued and running on a per-user and queue basis. Queues and limits are subject to change based on the demand and workload distribution.
Partition | Max resources per User |
Max nodes per Job |
Max duration (wall time) |
Max running jobs |
Max running + submitted jobs |
---|---|---|---|---|---|
intel | 256 cores | 2 nodes | 120 hours | 10 | 15 |
amd | 1024 cores | 4 nodes | 120 hours | 20 | 30 |
gpu-a30 | 16 GPUs (A30) | 4 nodes | 72 hours | 4 | 8 |
gpu-l20 | 8 GPU (L20) | 2 nodes | 72 hours | 2 | 4 |
Quality of Service (QoS)
The resource limits for jobs are controlled by the Quality of Service (QoS) in Slurm. A set of default QoS have been configured, dictating the limit in the resources and partitions that a job is entitled to request. In general, users do not need to deal with this unless you have been approved to use another QoS in some scenario. As such, you need to supply the QoS parameter in the job request. One of the examples is the interactive job for debugging purpose in all partitions.
QoS | Supported Partitions |
Max duration | Max resources per Job |
Max running jobs |
Max running + submitted jobs |
---|---|---|---|---|---|
debug | intel, amd, gpu-a30, gpu-l20 | 4 hours | 2 nodes, 2 GPUs | 1 | 1 |