|
| 1 | +# Run MaxText Python Notebooks on TPUs |
| 2 | + |
| 3 | +This guide provides clear, step-by-step instructions for getting started with python notebooks on the two most popular platforms: Google Colab and a local JupyterLab environment. |
| 4 | + |
| 5 | +## 📑 Table of Contents |
| 6 | + |
| 7 | +- [Prerequisites](#prerequisites) |
| 8 | +- [Method 1: Google Colab with TPU](#method-1-google-colab-with-tpu) |
| 9 | +- [Method 2: Local Jupyter Lab with TPU](#method-2-local-jupyter-lab-with-tpu) |
| 10 | +- [Available Examples](#available-examples) |
| 11 | +- [Common Pitfalls & Debugging](#common-pitfalls--debugging) |
| 12 | +- [Support & Resources](#support-and-resources) |
| 13 | +- [Contributing](#contributing) |
| 14 | + |
| 15 | +## Prerequisites |
| 16 | + |
| 17 | +Before starting, make sure you have: |
| 18 | + |
| 19 | +- ✅ Basic familiarity with Jupyter, Python, and Git |
| 20 | + |
| 21 | +**For Method 2 (Local Jupyter Lab) only:** |
| 22 | +- ✅ A Google Cloud Platform (GCP) account with billing enabled |
| 23 | +- ✅ TPU quota available in your region (check under IAM & Admin → Quotas) |
| 24 | +- ✅ `tpu.nodes.create` permission to create a TPU VM |
| 25 | +- ✅ gcloud CLI installed locally |
| 26 | +- ✅ Firewall rules open for port 8888 (Jupyter) if accessing directly |
| 27 | + |
| 28 | +## Method 1: Google Colab with TPU |
| 29 | + |
| 30 | +This is the fastest way to run MaxText python notebooks without managing infrastructure. |
| 31 | + |
| 32 | +**⚠️ IMPORTANT NOTE ⚠️** |
| 33 | +The free tier of Google Colab provides access to `v5e-1 TPU`, but this access is not guaranteed and is subject to availability and usage limits. |
| 34 | + |
| 35 | +Before proceeding, please verify that the specific notebook you are running works reliably on the free-tier TPU resources. If you encounter frequent disconnections or resource limitations, you may need to: |
| 36 | + |
| 37 | +* Upgrade to a Colab Pro or Pro+ subscription for more stable and powerful TPU access. |
| 38 | + |
| 39 | +* Move to local Jupyter Lab setup method with access to a powerful TPU machine. |
| 40 | + |
| 41 | +### Step 1: Choose an Example |
| 42 | +1.a. Visit the [MaxText examples directory](https://github.com/AI-Hypercomputer/maxtext/tree/main/src/MaxText/examples) on Github. |
| 43 | + |
| 44 | +1.b. Find the notebook you want to run (e.g., `sft_qwen3_demo.ipynb`) and copy its URL. |
| 45 | + |
| 46 | +### Step 2: Import into Colab |
| 47 | +2.a. Go to [Google Colab](https://colab.research.google.com/) and sign in. |
| 48 | + |
| 49 | +2.b. Select **File** -> **Open Notebook**. |
| 50 | + |
| 51 | +2.c. Select the **GitHub** tab. |
| 52 | + |
| 53 | +2.d. Paste the target `.ipynb` link you copied in step 1.b and press Enter. |
| 54 | + |
| 55 | +### Step 3: Enable TPU Runtime |
| 56 | + |
| 57 | +3.a. **Runtime** → **Change runtime type** |
| 58 | + |
| 59 | +3.b. Select your desired **TPU** under **Hardware accelerator** |
| 60 | + |
| 61 | +3.c. Click **Save** |
| 62 | + |
| 63 | +### Step 4: Run the Notebook |
| 64 | +Follow the instructions within the notebook cells to install dependencies and run the training/inference. |
| 65 | + |
| 66 | +## Method 2: Local Jupyter Lab with TPU |
| 67 | + |
| 68 | +You can run Python notebooks on a local JupyterLab environment, giving you full control over your computing resources. |
| 69 | + |
| 70 | +### Step 1: Set Up TPU VM |
| 71 | + |
| 72 | +In Google Cloud Console: |
| 73 | + |
| 74 | +1.a. **Compute Engine** → **TPU** → **Create TPU** |
| 75 | + |
| 76 | +1.b. Example config: |
| 77 | + - **Name:** `maxtext-tpu-node` |
| 78 | + - **TPU type:** Choose your desired TPU type |
| 79 | + - **Runtime Version:** `tpu-ubuntu2204-base` (or other compatible runtime) |
| 80 | + |
| 81 | +### Step 2: Connect with Port Forwarding |
| 82 | +Run the following command on your local machine: |
| 83 | +> **Note**: The `--` separator before the `-L` flag is required. This tunnels the remote port 8888 to your local machine securely. |
| 84 | +
|
| 85 | +```bash |
| 86 | +gcloud compute tpus tpu-vm ssh maxtext-tpu-node --zone=YOUR_ZONE -- -L 8888:localhost:8888 |
| 87 | +``` |
| 88 | + |
| 89 | +> **Note**: If you get a "bind: Address already in use" error, it means port 8888 is busy on your local computer. Change the first number to a different port, e.g., -L 9999:localhost:8888. You will then access Jupyter at localhost:9999. |
| 90 | +
|
| 91 | +### Step 3: Install Dependencies |
| 92 | + |
| 93 | +Run the following commands on your TPU-VM: |
| 94 | + |
| 95 | +```bash |
| 96 | +sudo apt update && sudo apt upgrade -y |
| 97 | +sudo apt install python3-pip python3-dev git -y |
| 98 | +pip3 install jupyterlab |
| 99 | +``` |
| 100 | + |
| 101 | +### Step 4: Start Jupyter Lab |
| 102 | + |
| 103 | +```bash |
| 104 | +jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root |
| 105 | +``` |
| 106 | + |
| 107 | +### Step 5: Access the Notebook |
| 108 | +5.a. Look at the terminal output for a URL that looks like: `http://127.0.0.1:8888/lab?token=...` |
| 109 | + |
| 110 | +5.b. Copy that URL. |
| 111 | + |
| 112 | +5.c. Paste it into your **local computer's browser**. |
| 113 | + * **Important:** If you changed the port in Step 2 (e.g., to `9999`), you must manually replace `8888` in the URL with `9999`. |
| 114 | + * *Example:* `http://127.0.0.1:9999/lab?token=...` |
| 115 | + |
| 116 | + |
| 117 | +## Available Examples |
| 118 | + |
| 119 | +### Supervised Fine-Tuning (SFT) |
| 120 | + |
| 121 | +- **`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k) |
| 122 | +- **`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) |
| 123 | + |
| 124 | +### Reinforcement Learning (GRPO/GSPO) Training |
| 125 | + |
| 126 | +- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k) |
| 127 | + |
| 128 | +## Common Pitfalls & Debugging |
| 129 | + |
| 130 | +| Issue | Solution | |
| 131 | +|-------|----------| |
| 132 | +| ❌ TPU runtime mismatch | Check TPU runtime version matches VM image | |
| 133 | +| ❌ Colab disconnects | Save checkpoints to GCS or Drive regularly | |
| 134 | +| ❌ "RESOURCE_EXHAUSTED" errors | Use smaller batch size or v5e-8 instead of v5e-1 | |
| 135 | +| ❌ Firewall blocked | Ensure port 8888 open, or always use SSH tunneling | |
| 136 | +| ❌ Path confusion | In Colab use `/content/maxtext`; in TPU VM use `~/maxtext` | |
| 137 | + |
| 138 | +## Support and Resources |
| 139 | + |
| 140 | +- 📘 [MaxText Documentation](https://maxtext.readthedocs.io/) |
| 141 | +- 💻 [Google Colab](https://colab.research.google.com) |
| 142 | +- ⚡ [Cloud TPU Docs](https://cloud.google.com/tpu/docs) |
| 143 | +- 🧩 [Jupyter Lab](https://jupyterlab.readthedocs.io) |
| 144 | + |
| 145 | +## Contributing |
| 146 | + |
| 147 | +If you encounter issues or have improvements for this guide, please: |
| 148 | + |
| 149 | +1. Open an issue on the MaxText repository |
| 150 | +2. Submit a pull request with your improvements |
| 151 | +3. Share your experience in the discussions |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +**Happy Training! 🚀** |
0 commit comments