Example TensorFlow:
1. In the first terminal
- Check the node name. For example:gpu01
- If you are not using container, change to a virtual environment by "source activate xxx"
- Run Interactive gpu node using srun. Replace <yourgroupname> with your own project group name
srun --partition=gpu --gres=gpu:a30:1 --account=<pi_account_name> --pty bash netid@gpu01:~$
2. (Skip if not using container)Create Tensorflow image if it is not available.
apptainer pull tensorflow:23.11-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:23.11-tf2-py3
3. (Skip if not using container)Run tf image and mount your directory preferred to a mount point in container. In this example, we map our own home directory to /project in container( /project does not need to already exist in the container)
apptainer run -B /home/<username>:/project --nv tensorflow:23.11-tf2-py3.sif
Container started sucessfully
[username@gpu01]$ apptainer run -B /home/kcalexlam:/project --nv tensorflow:23.11-tf2-py3.sif ================ == TensorFlow == ================ NVIDIA Release 23.11-tf2 (build 75076557) TensorFlow Version 2.14.0 Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Copyright 2017-2023 The TensorFlow Authors. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.183.01. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details. NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not detected. Multi-node communication performance may be reduced. bash: /opt/shared/spack/share/spack/setup-env.sh: No such file or directory bash: /usr/share/lmod/lmod/libexec/lmod: No such file or directory Apptainer>
4. Type: jupyter-lab --allow-root --ip='0.0.0.0' to start Jupyter lab from container and accept connections from any IP address:
Apptainer> jupyter-lab --allow-root --ip='0.0.0.0'
5. Mark the token for the second terminal
Apptainer> jupyter-lab --allow-root --ip='0.0.0.0' [I 11:12:39.835 LabApp] jupyter_tensorboard extension loaded. [I 11:12:40.032 LabApp] JupyterLab extension loaded from /usr/local/lib/python3.10/dist-packages/jupyterlab [I 11:12:40.032 LabApp] JupyterLab application directory is /usr/local/share/jupyter/lab [I 11:12:40.034 LabApp] [Jupytext Server Extension] NotebookApp.contents_manager_class is (a subclass of) jupytext.TextFileContentsManager already - OK [I 11:12:40.037 LabApp] Serving notebooks from local directory: /home/username [I 11:12:40.037 LabApp] Jupyter Notebook 6.4.10 is running at: [I 11:12:40.037 LabApp] http://hostname:8888/?token=[I 11:12:40.037 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 11:12:40.048 LabApp] To access the notebook, open this file in a browser: file:///home/username/.local/share/jupyter/runtime/nbserver-37636-open.html Or copy and paste this URL: http://hostname:8888/?token=
6. Open another terminal to do second login. Do port mapping between compute node and your host, replace -xx with number.
ssh username@hpc4.ust.hk -L 8888:gpuxx:8888 The authenticity of host 'hpc4.ust.hk (143.89.184.3)' can't be established. ED25519 key fingerprint is SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx. This host key is known by the following other names/addresses: C:\Users\username/.ssh/known_hosts:18: xxxxxxxx Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added 'hpc4.ust.hk' (ED25519) to the list of known hosts. Enter passphrase for key 'C:\Users\username/.ssh/id_rsa': Last login: Tue Jun 25 10:42:45 2024 from xxxxxxxxxx [kcalexlam@login2 ~]$
7. Open the browser and type “http://127.0.0.1:8888/?token=????
8. Done.
Reference: HPC Intro course: https://nptel.ac.in/courses/128/106/128106014/