A script for combining the power of tmux with the TPU VMs. The script currently handles both TPU-v3, TPU-v4 and TPU-v5. The main idea is to use tmux for executing identical commands on multiple VMs.
# Download ttconnect
wget https://raw.githubusercontent.com/peregilk/ttconnect/main/ttconnect
# Make the program executable
chmod a+x ttconnect
# Optionally copy it to a place in your path (like /usr/local/bin/)The script is made for handling TPU pods. It will automatically open a tmux window with a tile for each of the TPU VMs, allowing them to be controlled both in parallel and individually.
# Open a connection to an already existing TPU VM or TPU-VMs.
# If one is not provided, it will default to us-central2-b
./ttconnect TPU-name [zone]
This command will open connections to all the workers in a tmux with split panes. A typical workspace for a v4-32 looks like this:
Depending upon how many windows that are open, it might be beneficial to change the layout mode. You can cycle through the five different layout modes with this command:
C-b <space>The default setting is syncronized panes. Whatever you type in one pane, will then happen in all the panes. However, if you like to make a change only to one of the TPUs, you can turn off this behaviour by setting:
C-b: setw synchronize-panes offIt might happen that one of the tpus dies for some reason, and it might not be the one that is in focus. To target specific panes there are a few tricks that I like to use. Firstly you can always go to another pane using ctrl-b <arrow>. However, in many cases this pane is too small for working. If you have multiple VMs running, the first thing would then be to switch to the layout main-horisontal(see above). After you have done this, use the following command to see the id of each of the panes:
C-b qWhen you know the id of the target pane, you can use the command below setting the N=id:
C-b:swap-pane -t NYou can detach from the windows by doing
C-b dHowever, if you really want to zap the entire window, you will have to do:
C-b: kill-windowYou can then use ttconnectto connect to the same pod again with a fresh login.
In rare cases, some scripts crashes. If you dont want to recreate the TPUs/VMs, this is really useful commands.
gcloud alpha compute tpus tpu-vm ssh MyName --project=MyProject-11111 --zone=MyZone --worker=all --command="sudo pkill -9 python"In some very rare cases, I have experienced that there still can be stuck programs that prevents the training scripts to restart. This is my last trick:
gcloud alpha compute tpus tpu-vm ssh MyName --project=MyProject-11111 --zone=MyZone --worker=all --command="ps ax | grep python | grep -v grep | awk '{print \$1}' | xargs -r sudo kill -9"For more advanced use, please refer to the tmux documentation.
This is really just an tmux tips but it seems like a lot of tmux users simply is not aware of its most useful feature. List all sessions:
C-b sFeel free to modify the script, and to add features. If you come up with improvements, I will be glad to add them into the script. Please send any comments to per@capia.no.
