Submitting jobs to partitions with checkpoints
Normal project users are able to submit job to “large-project” partition with same resources limitation as that of “project” partition, but their jobs might be preempted by higher priority jobs. These jobs can run for at least 2 hours before they are eligible for preemption.
You are advised to enable checkpointing in your code. Refer to the following webpages for details.
- Tensorflow: https://www.tensorflow.org/guide/checkpoint
- Pytorch: https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html
Using "sbatch" for jobs with long execution time
For submitting jobs with long execution time, batch SLURM script submission (sbatch) shall be used.
The default wall time is 8 hours.
Maximum wall time is 3 days.
If you want to run job longer than 8 hours, please specify it in your SLURM script such as:
#SBATCH --time=72:00:00 Or #SBATCH --time=3-00:00:00
Interactive jobs with maximum wall time of 2 hours for large and normal partition
For interactive jobs using srun and salloc SLURM commands, it will reset to a wall time of 2 hours by the SLURM scheduler. For jobs that needs to run within a longer wall time, please use sbatch instead. Below is a sample of an interactive job using srun in requesting 2 GPUs.
srun --partition normal --account=xxx --nodes=1 --gpus-per-node=2 --pty bash
No interactive jobs are allowed under cpu partition.
Utilizing the right partition for your jobs
- CPU partition is recommended for tasks such as data processing or data visualization which requires CPU instead of GPU. CPU machines are connected to the HKUST SuperPOD cluster through the partition name “cpu”. However, CPU nodes could not access scratch space. To run your jobs in the CPU partition, specify the parameter "--partition cpu" in your command.
- Debug partition is designed for interactive tasks such as debugging and brief testing. Use the parameter, "--partition debug" to run your jobs in the debug partition. If you require multiple GPUs for the debug purpose, you may use the project and large-project partitions. However, the wall time will be limited to 4 hrs.