HKUST SuperPOD - A TensorFlow Example
Example TensorFlow:
1. In the first terminal
- Run an interactive job in gpu node using srun. Supply your project group name and the partition (e.g. normal ) going to use.
srun --partition normal --account=<yourgroupname> --gres=gpu:2 --pty $SHELL netid@dgx-26:~$
- Note the DGX node name dgx-26 in the above example.
2. (Skip if not using container)Create Tensorflow image if it is not available.
apptainer pull tensorflow:23.11-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:23.11-tf2-py3
3. (Skip if not using container)Run tf image and mount your directory preferred to a mount point in container. In this example, we map our own scratch space to /scratch in container( /scratch does not need to already exist in the container)
apptainer run -B /scratch/yournetid:/scratch --nv tensorflow:23.11-tf2-py3.sif
4. Type: jupyter-lab --allow-root --ip='0.0.0.0'
5. Mark the token for the second terminal
6. Open another terminal to do second login. Do port mapping between compute node and your host, replace -xx with number. For our case should be dgx-26
ssh yournetid@superpod -L 8888:dgx-xx:8888
7. Open the browser in your local workstation and type “http://127.0.0.1:8888/?token=????
8. Done.