Slurm Partition and Resource Quota

Partitions are work queues that have a set of rules/policies and computational nodes included in it to run the jobs. The available partitions are normal, large, and cpu. You can run sinfo to find the available list of partitions in discovery.

Resource Request Policy

  • Computational resource in HKUST SuperPOD is requested in units of H800 (80GB) GPU that each GPU is associated with the default CPU cores and system memory in Slurm as below:
    • 14 CPU cores with 28 Threads
    • 224GB system memory
  • In general, we recommend that users just specify the --gpus parameter for requested number of GPUs and --nodes parameter for required number of nodes in a job request and let Slurm allocate the cores and memory among the nodes for the optimized resource utilization.
  • For normal partition, it supports job request for mainstream GPU computation that varies from a single H800 GPU and up to 16 GPUs in maximum.
  • For large partition, it supports large job request for multi-nodes that the request unit must be in multiple of 8 H800 GPUs i.e. a full node, The minimum number of requested nodes is 2 (16 GPUs) and up to 12 nodes (96 GPUs) in maximum.
  • The number of nodes assigning to large and normal partitions may vary depending on the different workload condition.
  • For job request on very large number of nodes, i.e. large than 12 nodes, such request must be arranged by reservation only,    

 

Partition Table

Slurm Partition large normal cpu

No. of nodes

35 DGX nodes

20 DGX nodes

2 Intel nodes

Purpose

For large scale GPU computation with multi-nodes

For mainstream GPU computation

Data pre-processing for GPU computation

Max Wall Time

3 days

3 days

12 hours

Min resource requested per job

16 GPUs (or equivalent to 2 nodes)

1 GPU

1 CPU core

Max resource requested per account

96 GPUs (or equivalent to 12 nodes)

16 GPUs

8 CPU cores (per job)

Concurrent running jobs quota per user

4

8

28

Queuing and running jobs limit per user

5

10

28

Chargeable

Yes

Yes

No

Interactive job

Allow one session with maximum 8 hours wall time

Allow one session with maximum 8 hours wall time

Not Allow

Remarks

GPU resources must be requested in multiple of 8 (full node)

GPU resources can be requested in any quantity not more than max

No access to the /scratch directory for the time being