Skip to content

Commit 921848a

Browse files
Merge pull request #2768 from AI-Hypercomputer:jackyf/docs/fix-tpu-colab-setup
PiperOrigin-RevId: 841831128
2 parents c75f123 + f2cb1af commit 921848a

File tree

4 files changed

+157
-211
lines changed

4 files changed

+157
-211
lines changed

docs/guides.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,4 +23,5 @@ guides/optimization.md
2323
guides/data_input_pipeline.md
2424
guides/checkpointing_solutions.md
2525
guides/monitoring_and_debugging.md
26+
guides/run_python_notebook.md
2627
```

docs/guides/run_python_notebook.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Run MaxText Python Notebooks on TPUs
2+
3+
This guide provides clear, step-by-step instructions for getting started with python notebooks on the two most popular platforms: Google Colab and a local JupyterLab environment.
4+
5+
## 📑 Table of Contents
6+
7+
- [Prerequisites](#prerequisites)
8+
- [Method 1: Google Colab with TPU](#method-1-google-colab-with-tpu)
9+
- [Method 2: Local Jupyter Lab with TPU](#method-2-local-jupyter-lab-with-tpu)
10+
- [Available Examples](#available-examples)
11+
- [Common Pitfalls & Debugging](#common-pitfalls--debugging)
12+
- [Support & Resources](#support-and-resources)
13+
- [Contributing](#contributing)
14+
15+
## Prerequisites
16+
17+
Before starting, make sure you have:
18+
19+
- ✅ Basic familiarity with Jupyter, Python, and Git
20+
21+
**For Method 2 (Local Jupyter Lab) only:**
22+
- ✅ A Google Cloud Platform (GCP) account with billing enabled
23+
- ✅ TPU quota available in your region (check under IAM & Admin → Quotas)
24+
-`tpu.nodes.create` permission to create a TPU VM
25+
- ✅ gcloud CLI installed locally
26+
- ✅ Firewall rules open for port 8888 (Jupyter) if accessing directly
27+
28+
## Method 1: Google Colab with TPU
29+
30+
This is the fastest way to run MaxText python notebooks without managing infrastructure.
31+
32+
**⚠️ IMPORTANT NOTE ⚠️**
33+
The free tier of Google Colab provides access to `v5e-1 TPU`, but this access is not guaranteed and is subject to availability and usage limits.
34+
35+
Before proceeding, please verify that the specific notebook you are running works reliably on the free-tier TPU resources. If you encounter frequent disconnections or resource limitations, you may need to:
36+
37+
* Upgrade to a Colab Pro or Pro+ subscription for more stable and powerful TPU access.
38+
39+
* Move to local Jupyter Lab setup method with access to a powerful TPU machine.
40+
41+
### Step 1: Choose an Example
42+
1.a. Visit the [MaxText examples directory](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/MaxText/examples) on Github.
43+
44+
1.b. Find the notebook you want to run (e.g., `sft_qwen3_demo.ipynb`) and copy its URL.
45+
46+
### Step 2: Import into Colab
47+
2.a. Go to [Google Colab](https://colab.research.google.com/) and sign in.
48+
49+
2.b. Select **File** -> **Open Notebook**.
50+
51+
2.c. Select the **GitHub** tab.
52+
53+
2.d. Paste the target `.ipynb` link you copied in step 1.b and press Enter.
54+
55+
### Step 3: Enable TPU Runtime
56+
57+
3.a. **Runtime****Change runtime type**
58+
59+
3.b. Select your desired **TPU** under **Hardware accelerator**
60+
61+
3.c. Click **Save**
62+
63+
### Step 4: Run the Notebook
64+
Follow the instructions within the notebook cells to install dependencies and run the training/inference.
65+
66+
## Method 2: Local Jupyter Lab with TPU
67+
68+
You can run Python notebooks on a local JupyterLab environment, giving you full control over your computing resources.
69+
70+
### Step 1: Set Up TPU VM
71+
72+
In Google Cloud Console:
73+
74+
1.a. **Compute Engine****TPU****Create TPU**
75+
76+
1.b. Example config:
77+
- **Name:** `maxtext-tpu-node`
78+
- **TPU type:** Choose your desired TPU type
79+
- **Runtime Version:** `tpu-ubuntu2204-base` (or other compatible runtime)
80+
81+
### Step 2: Connect with Port Forwarding
82+
Run the following command on your local machine:
83+
> **Note**: The `--` separator before the `-L` flag is required. This tunnels the remote port 8888 to your local machine securely.
84+
85+
```bash
86+
gcloud compute tpus tpu-vm ssh maxtext-tpu-node --zone=YOUR_ZONE -- -L 8888:localhost:8888
87+
```
88+
89+
> **Note**: If you get a "bind: Address already in use" error, it means port 8888 is busy on your local computer. Change the first number to a different port, e.g., -L 9999:localhost:8888. You will then access Jupyter at localhost:9999.
90+
91+
### Step 3: Install Dependencies
92+
93+
Run the following commands on your TPU-VM:
94+
95+
```bash
96+
sudo apt update && sudo apt upgrade -y
97+
sudo apt install python3-pip python3-dev git -y
98+
pip3 install jupyterlab
99+
```
100+
101+
### Step 4: Start Jupyter Lab
102+
103+
```bash
104+
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
105+
```
106+
107+
### Step 5: Access the Notebook
108+
5.a. Look at the terminal output for a URL that looks like: `http://127.0.0.1:8888/lab?token=...`
109+
110+
5.b. Copy that URL.
111+
112+
5.c. Paste it into your **local computer's browser**.
113+
* **Important:** If you changed the port in Step 2 (e.g., to `9999`), you must manually replace `8888` in the URL with `9999`.
114+
* *Example:* `http://127.0.0.1:9999/lab?token=...`
115+
116+
117+
## Available Examples
118+
119+
### Supervised Fine-Tuning (SFT)
120+
121+
- **`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
122+
- **`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
123+
124+
### Reinforcement Learning (GRPO/GSPO) Training
125+
126+
- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
127+
128+
## Common Pitfalls & Debugging
129+
130+
| Issue | Solution |
131+
|-------|----------|
132+
| ❌ TPU runtime mismatch | Check TPU runtime version matches VM image |
133+
| ❌ Colab disconnects | Save checkpoints to GCS or Drive regularly |
134+
| ❌ "RESOURCE_EXHAUSTED" errors | Use smaller batch size or v5e-8 instead of v5e-1 |
135+
| ❌ Firewall blocked | Ensure port 8888 open, or always use SSH tunneling |
136+
| ❌ Path confusion | In Colab use `/content/maxtext`; in TPU VM use `~/maxtext` |
137+
138+
## Support and Resources
139+
140+
- 📘 [MaxText Documentation](https://maxtext.readthedocs.io/)
141+
- 💻 [Google Colab](https://colab.research.google.com)
142+
-[Cloud TPU Docs](https://cloud.google.com/tpu/docs)
143+
- 🧩 [Jupyter Lab](https://jupyterlab.readthedocs.io)
144+
145+
## Contributing
146+
147+
If you encounter issues or have improvements for this guide, please:
148+
149+
1. Open an issue on the MaxText repository
150+
2. Submit a pull request with your improvements
151+
3. Share your experience in the discussions
152+
153+
---
154+
155+
**Happy Training! 🚀**

docs/tutorials/post_training_index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,13 @@ Pathways supercharges RL with:
4949

5050
## Getting started
5151

52-
Start your Post-Training journey through quick experimentation with our [Google Colabs](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/how_to_run_colabs.html) or our Production level tutorials for [SFT](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html) and [RL](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
52+
Start your Post-Training journey through quick experimentation with [Python Notebooks](https://maxtext.readthedocs.io/en/latest/guides/run_python_notebook.html) or our Production level tutorials for [SFT](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/sft_on_multi_host.html) and [RL](https://maxtext.readthedocs.io/en/latest/tutorials/posttraining/rl_on_multi_host.html).
5353

5454
## More tutorials
5555

5656
```{toctree}
5757
:maxdepth: 1
5858
59-
posttraining/how_to_run_colabs.md
6059
posttraining/sft.md
6160
posttraining/sft_on_multi_host.md
6261
posttraining/rl.md

docs/tutorials/posttraining/how_to_run_colabs.md

Lines changed: 0 additions & 209 deletions
This file was deleted.

0 commit comments

Comments
 (0)