Cluster Resource Limits

Accounting

The Cluster account is organized with the principal investigator (PI) or the group leader of a research team. Each member of the team has an individual user account under PI’s group to access the cluster and run jobs on the partitions (queues) with SLURM. With this accounting scheme, the system can impose resource limits (usage quota) on different partitions for different groups of users.

Resource limits

Compute node has processors, memory, swap and local disk as resources. Our cluster resource allocation is based on CPU core only. In particular, no core can run more than one job in a partition at a time. In case one needs to use a number of nodes exclusively for a job, user can specify exclusive option in the slurm script. The resource limits on partitions are imposed on PI group as a whole. This implies that individual users in the same group share the quota limit.

Partitions

The ownership of HPC3 compute nodes are diversified by individual PI. The contributor's partitions and their configurations are summarized as follows.

Partition No. of Nodes CPU Memory  Coprocessor
cpu 128 2 x Intel Xeon Gold 6230 (2*20-core) 192GB
himem 5 2 x Intel Xeon Gold 6230 (2*20-core) 1.5TB
gpu 14 2 x Intel Xeon Gold 6244 (2*8-core) 384GB  8 x Nvidia GeForce RTX-2080Ti

For the quota terminology, please refer here .

Job Scheduling

Currently the SLURM jobs are scheduled with priority of contribution of HPC hardware, i.e. HPC3 cluster server contributor will have higher priority to use their contributed hardware and those idle hardware can be used by others. 

Contributor's partition scheduling limit (for PI group with contributed hardware)

For PI or department with hardware contribution to the HPC3 cluster, individual PI has decided the scheduling limit of their respective PI group. Details can be reviewed after login to the system.

Community partition scheduling limit (for the use of other group's contributed hardware)

In order to maximize the usage of computational resources, ITSC has configured a community cluster strategy such that idle resources on the HPC3 cluster can be used by all researchers. The community cluster can be accessed via the partition mentioned above. Jobs submitted on these partition are scheduled ONLY when there is idle resources and the maximum wall-time is 24 hours. Usage of this community cluster is open to all researchers. The usage quota is summarized as follows.

Please note that when using non-contributed hardware, there will be chances that the running job will be requeued when the system do not have idle nodes and HPC hardware contributor need to use the cluster for running their job. There will be a GraceTime before the job will be requeued. You are advised to implement regular checkpoint for your job so that the job can be retarted from the checkpoint.

Partition GrpJobs (Max) GrpNodes (Max) GrpSubmitJobs (Max) MaxWallTime GraceTime before requeue
cpu-share 6 6 6 24 hours 1 hour
himem-share 2 2 2 24 hours 1 hour
gpu-share 2 2 2 24 hours 1 hour