Skip to content

Commit cf99ce5

Browse files
authored
docs: require single TPU process on dev TPUs (#3120)
## Summary - Add explicit dev TPU concurrency guidance. - State that only one process can run TPU code at a time per dev TPU VM. - Clarify that remote SSH inner-loop TPU commands must run sequentially. ## Validation - `./infra/pre-commit.py --all-files` (fails in current workspace due unrelated pre-existing Black issues in `lib/levanter` files). - Commit-level hooks passed for this docs-only commit.
1 parent e2ed175 commit cf99ce5

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

docs/dev-guide/dev_tpu.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,12 @@ You will need to setup your SSH key in `gcloud` to get started.
66
It is usually faster than wiring a full Ray job when you want quick iteration. It is less good if you want to run many different commands in parallel,
77
or if you want to run a long experiment and not worry about the TPU going away.
88

9+
## Critical concurrency rule
10+
11+
Run at most one TPU job at a time on a given dev TPU VM. Only one process can run TPU code at a time on the same dev TPU.
12+
Do not launch concurrent TPU commands (including in separate shells, tmux panes, or background jobs) against one dev TPU;
13+
queue them and run sequentially instead.
14+
915
## What it does
1016

1117
- `allocate`: reserves a TPU VM and keeps it alive while the command runs. It also creates an SSH alias for the TPU and writes config to `~/.ssh/config` so you can connect easily.
@@ -118,6 +124,7 @@ source ~/.local/bin/env
118124
```
119125

120126
Then run multiple commands directly on remote.
127+
Run them one at a time when they execute TPU code.
121128

122129
### 3) Pull profiles and traces
123130

0 commit comments

Comments
 (0)