System Overview and hardware configuration

The HPC3 cluster is an in-house designed high performance computing (HPC) facility at HKUST set up in May 2020.  As of Sep 2021, it has 165 CPU compute nodes and 25 GPU compute nodes which are infiniBand (IB) connected at 100 Gbit/s with 2PB raw disk storage. The total number of CPU cores and GPU cards are 7412 and 230 respectively.

The HPC3 cluster design is based on the principle to maximum performance and available computing resources with the allocated funding, as such, the design emphasized on performance, number of CPU / GPU nodes available and maximum raw disk storage, while redundancy is only available for essential equipment.

Details

The HPC3 cluster consists of the following equipment:

Master & login node: The master node has been setup with OpenHPC cluster management system for cluster management, job scheduling and monitoring. The login node is the entry point for user to login to compile and submit their job

  Master node login node
Number of node 1 1
CPU 2x Intel Xeon Gold 6230 (20-core/2.1 GHz/28MB cache) 2x Intel Xeon Gold 5217 (8-core/3.0 GHz/11MB cache)
RAM 6x 16GB DDR4-2933 6x 16GB DDR4-2933
Storage

2x 2.4TB 12Gb/s 10K rpm Hot-swap SAS HDD

1x 4TB 12Gb/s 7.2K rpm SAS HDD

Inspur SAS3008 (IMR) 12Gb/s RAID Adapter

2x 2.4TB 12Gb/s 10K rpm Hot-swap SAS HDD

1x 4TB 12Gb/s 7.2K rpm SAS HDD

Inspur SAS3008 (IMR)12Gb/s RAID Adapter 

Network Dual-port EDR (100Gbps) InfiniBand (IB) network card
Dual-port 10Gbps ethernet network card with SR SFP+ connector
Dual 1Gb Ethernet adapter
Dual-port EDR (100Gbps) InfiniBand (IB) network card
Dual-port 10Gbps ethernet network card with SR SFP+ connector
Dual 1Gb Ethernet adapter

 

Compute node: 160 CPU compute nodes each with two 20-cores Intel Xeon 6230 CPU processors and 192GB physical memory

  CPU nodes
Number of node 160
CPU 2x Intel Xeon Gold 6230 (20-core/2.1 GHz/28MB cache)
RAM 12x 16GB DDR4-2933
Storage

2x 2.4TB 12Gb/s 10K rpm Hot-swap SAS HDD

Inspur SAS3008 (IMR) 12Gb/s RAID Adapter

Network Single-port EDR (100Gbps) InfiniBand (IB) network card
Dual 1Gb Ethernet adapter

 

Large memory node: 5 compute nodes with the same CPU as above and 1.5TB physical memory

  Large memory node
Number of node 5
CPU 2x Intel Xeon Gold 6230 (20-core/2.1 GHz/28MB cache)
RAM 24x 64GB DDR4-2933
Storage

2x 2.4TB 12Gb/s 10K rpm Hot-swap SAS HDD

Inspur SAS3008 (IMR) 12Gb/s RAID Adapter

Network

Single-port EDR (100Gbps) InfiniBand (IB) network card

Dual 1Gb Ethernet adapter

 

GPU node: 25 GPU nodes with different CPU and GPU models

  GPU node (RTX 2080 Ti) GPU node (RTX 2080 Ti) GPU node (RTX 6000) GPU node (RTX 3090)
Number of node 10 3 1 11
CPU 2x Intel Xeon Gold 6244 (8-core/3.6GHz/25MB Cache) 2x Intel Xeon Silver 4210 (10-core/2.2GHz/13.75MB) 2x Intel Xeon Silver 4210 (10-core/2.2GHz/13.75MB) 2x Intel Xeon Gold 6230R (26-core/2.1GHz/35.75MB Cache)
RAM 12x 32GB DDR4-2933 8x 32GB DDR4-2933 16x 32GB DDR4-2933 8x 32GB DDR4-2933
GPU 8x Nvidia GeForce RTX 2080 Ti 10x Nvidia GeForce RTX 2080 Ti 10x Nvidia Quadro RTX 6000 10x Nvidia GeForce RTX 3090
Storage

2x 960GB SATA 6G SSD

Inspur SAS3008 (IMR) (No Cache) 12Gb/s RAID Adapter 

2x 960GB SATA 6G SSD

Inspur SAS3008 (IMR) (No Cache) 12Gb/s RAID Adapter 

2x 960GB SATA 6G SSD

Inspur SAS3008 (IMR) (No Cache) 12Gb/s RAID Adapter 

2x 2TB SATA 6Gb/s 7.2K rpm
Network

1x Single-port EDR (100Gbps) InfiniBand (IB) network card

1x Dual 1Gb Ethernet adapter

1x Single-port EDR (100Gbps) InfiniBand (IB) network card

1x Dual 1Gb Ethernet adapter

1x Single-port EDR (100Gbps) InfiniBand (IB) network card

1x Dual 1Gb Ethernet adapter

AOC-MCX555A-ECAT ConnectX-5, 100GbE Single-Port QSFP28

 

File Systems: i) Parallel file system with 2PB raw storage installed with BeeGFS parallel cluster file system. ii) Archive (NFS) file system with 2PB storage as secondary storage of files

 

Interconnect: All servers are interconnected with Mallanox EDR InfinBand in a fat-tree topology (with blocking factor of 2).