Slurm Partition and Resource Quota

The system supports 2 types of user accounts: namely project-based and individual students with approved UROP

  • Project-based accounts
    • Allow to access more computational resources, with allocation granted during project approval
    • Computational resources are shared among all group members of the project
    • Usage accounting for computational resource is implemented, details to be announced later
    • Provide shared storage space for the group
       
  • Individual student accounts
    • Computational resources are allocated to each student individually
    • No usage accounting

 

Resource Request

  • Resource request is counted by GPU Resource Unit (GRU). Each GRU associated with different maximum CPU cores and system memory in slurm partitions.
     
  • For the project & large-project partitions, 1 GRU corresponds to
    • One H800 GPU with 80GB GPU memory
    • 14 CPU cores with 28 Threads
    • 224GB system memory

 

  • For the student partition, each H800 GPU is partitioned into different size of GPU instances using Nvidia MIG technology, with 1 GRU corresponds to either 3g.40gb, 4g.40gb or 7g.80gb MIG device
    • For 3g.40gb, 1 GRU is
      • 3/7 of one H800 GPU system computational power with 40GB GPU memory
    • For 4g.40gb, 1 GRU is
      • 4/7 of one H800 GPU system computational power with 40GB GPU memory
    • For 7g.40gb, 1 GRU is
      • equivalent to whole H800 GPU of computational power and memory
    • 8 CPU cores with 16 Threads
    • 160GB system memory
       
  • For the debug partition, 1 GRU corresponds to

    • One H800 GPU with 80GB GPU memory

    • 14 CPU cores with 28 Threads

    • 224GB system memory

 

Partition Table

Slurm Partition project & large-project student debug cpu

No. of DGX nodes

52

2 with GPU MIG partitioned

1

2 CPU nodes

Who can access

Project based users only

Non-project based student users only

All

Project based users only

Purpose

Computation

Computation

Compile, build container, interactive debug, code profiling

Data pre-processing for GPU computation

Max Wall Time

3 days

1 day

2 hours

12 hours

Max resource requested

Varies with projects,
default is 8 GRU

1 GRU

1 GRU

8 CPU cores (per job)

Concurrent running jobs quota per user

8

1

1

28

Queuing and running jobs limit per user

10

2

1

28

Usage Accounting

Yes

No

No

No

Job Preemption

In large-project partition, jobs from approved projects can preempt other jobs that can run for at least 2 hours before getting preempted

No

No

No

Remarks

Resources quota are per-project unless specified

Resources quota are per-user instead of per project

Resources quota are per-user instead of per project

Resources quota are per-project unless specified

No access to the /scratch directory