Cluster Resource Limits

Accounting

The Cluster account is organized with the principal investigator (PI) or the group leader of a research team. Each member of the team has an individual user account under PI’s group to access the cluster and run jobs on the partitions (queues) with SLURM. With this accounting scheme, the system can impose resource limits (usage quota) on different partitions for different groups of users.

Resource limits

Compute node has processors, memory, swap and local disk as resources. Our cluster resource allocation is based on CPU core only. In particular, no core can run more than one job in a partition at a time. In case one needs to use a number of nodes exclusively for a job, user can specify exclusive option in the slurm script. The resource limits on partitions are imposed on PI group as a whole. This implies that individual users in the same group share the quota limit.

Partitions

The ownership of HPC2 compute nodes are diversified. The partitions and their resource limits are summarized as follows.

    Partition No. of Nodes CPU Memory Coprocessor
    standard 20 2 x Intel Xeon E5-2670 v3 (12-core) 64G DDR4-2133
    himem 15 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
    gpu 5 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133 2 x Nvidia Tesla K80
    ssci 15 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
    cbme 1 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133 2 x Nvidia Tesla K80
    ce 2 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
    ch 10 2 x Intel Xeon E5-2670 v3 (12-core) 64G DDR4-2133
    ch1 1 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
    cse 1 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133 2 x Nvidia Tesla K80
    ece 1 2 x Intel Xeon E5-2683 v4 (16-core) 128G DDR4-2400
    ias 6 2 x Intel Xeon E5-2670 v3 (12-core) 256G DDR4-2133
    lifs 8 2 x Intel Xeon E5-2650 v4 (12-core) 128G DDR4-2400
    ph 3 2 x Intel Xeon E5-2670 v3 (12-core) 128G DDR4-2133
    sbm 4 2 x Intel Xeon E5-2683 v4 (16-core) 128G DDR4-2400
    Partition No. of Nodes Access (SSCI/SENG) GrpJobs (Max) GrpNodes (Max) GrpSubmit Jobs (Max) MaxWallTime
    standard 20 Both 4 5 4 3 days
    himem 15 Both 3 5 3 3 days
    gpu 5 GPU user 2 2 2 3 days
    ssci 15 SSCI only 3 5 3 3 days
    Partition No. of Nodes MaxCPUs/User MaxJobs/User MaxSubmit/User(Max) MaxWallTime
    cbme 1 24 10 10 60 days
    ce 2 48 4 4 20 days
    ch 10 96 3 4 7 days
    ch1 1 24 1 3 7 days
    cse 1 -- -- -- --
    ece 1 32 10 10 30 days
    ias 6 144 10 50 15 days
    lifs 8 192 8 10 5 days
    ph 3 72 72 108 30 days
    sbm 4 128 4 8 15 days

    For the quota terminology, please refer here.

    Job Scheduling

    Currently the SLURM jobs are scheduled with basic priority, i.e. first in fist out depending on the order of arrival.

    Community Cluster

    In order to maximize the usage of computational resources, ITSC has configured a community cluster strategy such that idle resources on the HPC2 cluster can be used by anybody. The community cluster can be accessed via the partition “general”. Jobs submitted on this partition are scheduled ONLY when there is idle resources and the maximum wall-time is 12 hours. Usage of this community cluster is open to all users. The usage quota is summarized as follows.

    Partition GrpJobs (Max) GrpNodes (Max) GrpSubmitJobs (Max) MaxWallTime
    general 2 6 2 12 hours

    Disk Quota

    The disk quota for each SSCI and SENG PI group in hpc2 is 2 TB, and for other PI group is 500 GB. The quota is shared among all members of the group. Usage exceeding the quota have 24-hour grace period to clean up the extra data. The total disk space available in the cluster is 340TB.


    To check the disk usage and quota of your group:

    lfs quota -h -g <your_group> /home

     

    Group Share Directory

    A share directory is assigned to each PI group. Users from the same group can access, create and modify files in the share directory.

    To access the share directory:

    cd /home/share/<your_group>

    or

    cd $PI_HOME

    Note that the group disk quota is also applied to the share directory.  

     

    Backup

    There is NO backup service on the cluster and user is required to manage the backup of the data themselves.

    Scratch Files

    There is about 900GB with the /tmp of the compute nodes for local scratch files. User is advised to make use of it and clear them up as soon as you are finished with your application. The files in the /tmp of all nodes will be removed by the system automatically if they are not accessed for more than 10 days.