Cluster Resource Limits

Accounting

The Cluster account is organized with the principal investigator (PI) or the group leader of a research team. Each member of the team has an individual user account under PI’s group to access the cluster and run jobs on the partitions (queues) with SLURM. With this accounting scheme, the system can impose resource limits (usage quota) on different partitions for different groups of users.

Resource limits

Compute node has processors, memory, swap and local disk as resources. Our cluster resource allocation is based on CPU core only. In particular, no core can run more than one job in a partition at a time. In case one needs to use a number of nodes exclusively for a job, user can specify exclusive option in the slurm script. The resource limits on partitions are imposed on PI group as a whole. This implies that individual users in the same group share the quota limit.

Partitions

Partition No. of Nodes CPU Memory  Coprocessor
x-gpu 30 2 x Intel Xeon Silver 4210 (2*10-core) 256GB RTX-2080Ti
x-gpu 5 2 x Intel Xeon Silver 4210 (2*10-core) 256GB RTX6000
x-gpu 3 2 x Intel Xeon Silver 4210 (2*10-core) 512GB RTX6000

For the quota terminology, please refer here .

Job Scheduling

Currently the SLURM jobs are scheduled with priority of contribution of HPC hardware, i.e. X-GPU cluster PI/CoPi group will have higher priority to use their contributed hardware and those idle hardware can be used by others. 

PI/Co-Pi partition scheduling limit (for PI group of the CRF project)

Partition GrpJobs (Max) GrpNodes (Max) GrpSubmitJobs (Max) MaxWallTime
x-gpu 20 4 20 7 days

Community partition scheduling limit (for the use of other group's contributed hardware)

In order to maximize the usage of computational resources, ITSC has configured a community cluster strategy such that idle resources on the X-GPU cluster can be used by all researchers. The community cluster can be accessed via the partition mentioned below. Jobs submitted on these partition are scheduled ONLY when there is idle resources and the maximum wall-time is 48 hours. Usage of this community cluster is open to all researchers. The usage quota is summarized as follows.

Please note that when using community partition, there will be chances that the running job will be requeued when the system do not have idle nodes and the other PI group need to use the cluster for running their jobs. There will be a GraceTime before the job will be requeued. You are advised to implement regular checkpoint for your job so that the job can be retarted from the checkpoint.

Partition GrpJobs (Max) GrpNodes (Max) GrpSubmitJobs (Max) MaxWallTime GraceTime before requeue
x-gpu-share 60 12 60 48 hours 1 hour