Tutorial for getting set up on the ICME cluster. This is used in the Stanford Course CME 218, Applied Data Science, a hands-on project course for graduate students working on machine learning and data science projects.
If something in this guide is unclear, please contact Elliot Epstein at [email protected]!
The first step is to get access to the ICME cluster. All registered students should have been added to the cluster. Let me know if this is not the case.
If you are using Windows, it's recommended to first install windows subsystem for linux. This can be installed here: https://learn.microsoft.com/en-us/windows/wsl/install
To use ssh to connect to the cluster you will need to be on the Stanford VPN.
Follow the guide here to set up the VPN. You will need to authenticate with one duo mobile to get on to the VPN. Now check that you have access by ssh'ing into the cluster. You can do this by typing the following command into your terminal:
ssh <username>@icme-course-login.stanford.edu
where <username>
is your Stanford username. A message similar to "The authenticity of the host 'icme-course-login.stanford.edu can't be established", are you sure you want to continue..." will appear. Type "yes" and hit enter. You will then be prompted to enter your password. Enter your Stanford password and hit enter. You should now be logged into the cluster.
If you get an error where nothing happens when you try to ssh in to the cluster, you may not be connected to the VPN.
You are currently on the login-node of the cluster. To run computation, you will need to transfer to a compute-node. To do this, type the following command into your terminal:
srun -p gpu-pascal --gres=gpu:1 --pty bash
The partiton is which GPU to use. You must use either the gpu-pascal, gpu-volta, or gpu-turing partition. The gres is the number of GPUs you want to use. The pty bash is to open a bash terminal.
To get two GPUs use the gpu-volta or gpu-turing paritions:
srun -p gpu-volta --gres=gpu:2 --pty bash
You can check that you are on the gpu node by running
nvidia-smi
This will show you which GPUs are available. If you get an error, you are most likely still on the login node.
Exit to the login node by running
control-a d
If you don't need access to GPUs, you can submit to CPU nodes instead:
srun --pty bash
When running code on the cluster, you may have package dependencies (such as Numpy or Pytorch) that you need to install. To manager your dependencies, I recommend to use Anaconda. Anaconda is a package manager for Python. Using Anaconda allows us to for example have different packages installed in different environment, which can be very useful if you are working on multiple projects with conflicting package dependencies.
Conda is already installed on the cluster. It's needs to be initiated first by running
eval "$(/opt/ohpc/pub/compiler/anancoda3/2023.09-0/bin/conda shell.bash hook)"
conda init
If the installation works correctly, the command
conda
should give a menu of options.
My go-to cheat-sheet for anaconda is here
Following the cheat cheet, let's create an anaconda environment.
conda create --name cme_218 python=3.9
This will take a few min.
Now, activate the environment with
source activate cme_218
Packages you install will now be installed on the cme_218 environment.
You can now list all your environments with
conda env list
List the installed packages with
conda list
Let's make sure we are on the Pascal partition. We can now add pytorch and it's dependencies. Start by checking the cuda version on the cluster, this can be done with
nvidia-smi
From this, I can see that the cuda version on the Pascal partition is 12.2. The most recent stable the release of Pytorch use CUDA 11.8, so let's use that version. Now we can install pytorch with the following command (see https://pytorch.org/get-started/locally/)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
This will take more then 10 minutes. You can check that pytorch was successfully installed with GPU support by starting a python terminal
python3
and then importing torch and seeing if a GPU is available
import torch
torch.cuda.is_available()
This should output True. Exit the python termial with
control-d
If you want to modify the code, you can either use a text editor such as emacs (https://www.gnu.org/software/emacs/) or you can use VSCode.
I recommend using VSCode. VSCode can be downloaded here: (https://code.visualstudio.com/)
Once VSCode is installed, you will need to install a few extensions to connect to the cluster.
You need to install the Remote - SSH extension. Press the extensions button in the left sidebar and use the search bar.
You will also need to install the Remote - SSH: Editing Configuration Files extension and te Remote Explorer extension.
Also install the Pylance and Python extensions for linting and debugging tools.
I also recommend to install GitHub Copilot, which you can get for free as a student.
Now, to connect to the cluster, press command-shift-p and type "Remote-SSH: Connect to Host". Then type:
<username>@icme-course-login.stanford.edu
where <username>
is your Stanford username. You will then need to enter your password twice.
You can now open the code and make changes from VSCode.
In addition to editing code, VS Code allows you to view images, render markdown files, and connect to a terminal.
You can open a terminal inside vs code by clicking on the terminal bar on the top left and selecting new terminal.
Type the following command to clone the repository:
git clone https://github.com/Elliotepsteino/ICME-cluster-onboarding.git
This will create a folder called ICME-cluster-onboarding. Navigate into this folder by typing:
cd ICME-cluster-onboarding
To get data on the cluster, you can use the scp
command. This command allows you to transfer files between your local computer and the cluster.
Download the images from the ICME-cluster-onboarding folder in the files on the CME 218 Canvas page (in case you don't have access to the CME 218 canvas page, I have included the images in the /images folder on the repo)
Then upload the images to the cluster by typing the following commands into your local terminal:
scp /local_path_to_image/mnist_image.png <username>@icme-course-login.stanford.edu:~/ICME-cluster-onboarding/
scp /local_path_to_image/mnist_image_rotated.png <username>@icme-course-login.stanford.edu:~/ICME-cluster-onboarding/
where <username>
is your Stanford username. This will upload the image to the ICME-cluster-onboarding folder on the cluster. If you use the images in the /image folder. Move them to the ICME-cluster-onboarding folder with the follwing command from the ICME-cluster-onboarding directory:
mv ./images/mnist_image_rotated.png .
mv ./images/mnist_image.png .
You can use the -r option if you want to move a directory instead of a file onto the cluster, for example
scp -r ./directory_to_move/ <username>@icme-course-login.stanford.edu:~/ICME-cluster-onboarding/
To download a file from the cluster, type the following command into your local terminal:
scp <username>@icme-course-login.stanford.edu:/file_path/<filename> /local_path/
where <username>
is your Stanford username, local_path is the path to where you want to save the file on your local computer, and filename is the name of the file you want to download. You can download the file to the current path by typing .
as the /local_path/.
Screen is a linux program that allows you to run multiple programs at once. It also allows you to run programs in the background, so that you can close the terminal without stopping the program. This is really useful if you have jobs that take a long time to run.
You can create a screen session by typing:
screen -S <screen_name>
where <screen_name>
is the name of your screen session. You can name it anything you want.
Typing this will create a screen session and you will be in the screen session. You can now run programs in the screen session.
To detach from the screen session, type:
control-a d
This takes you back to the terminal. To see all the screen sessions, type:
screen -list
This will show you all the screen sessions. To reattach to an existing screen session, type:
screen -r <screen_name>
where <screen_name>
is the name of your screen session. For example, you can use different conda environments in different screen sessions.
To remove a screen session, first attach to it,
and then do:
control-a k
Sometimes, it's also useful to split your screen into multiple windows, for example if you want to monitor GPU usage while running a program. From within a screen session, you can split the screen vertically by typing:
control-a |
To switch between the different windows, type:
control-a tab
Once on the new tab, you can create a terminal by typing:
control-a c
We will now train a basic Neural Network on the MNIST dataset.
Have a quick look at the code in the mnist_pytorch_example.py code. This code trains a simple neural network on the MNIST dataset. The MNIST dataset contains handwritten digits and the goal is to classify the digits correctly.
First, log onto a GPU node
srun -p gpu-pascal --gres=gpu:1 --pty bash
Let's create a screen session to run the training script in. Type:
screen -S CME_218
Let's make sure the conda environment we created earlier is activated, as the script uses PyTorch. Type:
source activate cme_218
Let's split the screen so we can monitor the GPU usage while training the model. Type:
control-a |
followed by:
control-a tab
and
control-a c
To create a new terminal in the split window. Let's run
nvidia-smi -l 1
This will show the GPU usage, refreshed every second. Navigate back to the original window and run the training script by typing:
python mnist_pytorch_example.py --epochs 2 --save-model
This will download the mnist dataset and train the model for 2 epochs and save the trained model weights. You should see the loss decreasing and arond 1.7 GB of GPU memory being used. The MNIST data will be downloaded in the home directory.
After the training is done, you can run inference on the images you uploaded. You may need to change the image name in the inference_single_image.py file.
python inference_single_image.py
What is the most likely class for the up-side down seven? What about the correctly oriented seven?
You can change which GPU you are using to run a script by typing:
CUDA_VISIBLE_DEVICES=<gpu_id>
If you requested two GPUs, you can use <gpu_id> either 0,1 or 0 or 1.
To print all the environment variables, type:
printenv
This will show which GPUs are available, among other things.
Render markdown file in VSCode:
command-shift-v
Use
htop
To show which programs are currently running. Sometimes, the memory on GPUs is not released properly after a program is done running. You can then kill the relevant processes with F9 in htop. Exit htop with F10.
Scrolling in the terminal while in a screen session can be done with
control-a esc
Exit the scrolling by pressing esc again.
Debug in python by setting breakpoints, by the breakpoint() function, this allows you to step through the code and inspect variables more efficiently than using print statements.
To track performance of ML jobs as they run, it's useful to use logging packages like Weights and Biases (https://wandb.ai/site).
Bash scripts is a useful way to run multiple jobs at once. There is a batch script to run the mnist example on two different GPU nodes, with different learning rate. Run the bash script with
bash run_mnist.sh
This will run the mnist example on two different GPU nodes, with different learning rate. Remember that you must be on a compute node with two GPUs to run this script.
Elliot Epstein
Thanks to Steve Jones and Brian Tempero for helpful tips on the cluster management.