Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/guides/run_python_notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,9 @@ You can run Python notebooks on a local JupyterLab environment, giving you full

### Step 1: Set Up TPU VM

In Google Cloud Console:
In Google Cloud Console, create a standalone TPU VM:

1.a. **Compute Engine** → **TPU** → **Create TPU**
1.a. **Compute Engine** → **TPUs** → **Create TPU**

1.b. Example config:
- **Name:** `maxtext-tpu-node`
Expand Down Expand Up @@ -118,12 +118,12 @@ jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

### Supervised Fine-Tuning (SFT)

- **`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
- **`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
- **`sft_qwen3_demo.ipynb`** → Qwen3-0.6B SFT training and evaluation on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k). This notebook is friendly for beginners and runs successfully on Google Colab's free-tier v5e-1 TPU runtime.
- **`sft_llama3_demo.ipynb`** → Llama3.1-8B SFT training on [Hugging Face ultrachat_200k dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k). We recommend running this on a v5p-8 TPU VM using the port-forwarding method.

### Reinforcement Learning (GRPO/GSPO) Training

- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)
- **`rl_llama3_demo.ipynb`** → GRPO/GSPO training on [OpenAI's GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k). We recommend running this on a v5p-8 TPU VM using the port-forwarding method.

## Common Pitfalls & Debugging

Expand Down
30 changes: 27 additions & 3 deletions docs/install_maxtext.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ seed-env \
--output-dir=generated_gpu_artifacts
```

## 4. Update Project Files
## Step 4: Update Project Files

After generating the new requirements, you need to update the files in the MaxText repository.

Expand All @@ -133,7 +133,7 @@ After generating the new requirements, you need to update the files in the MaxTe
2. **Update `extra_deps_from_github.txt` (if necessary):**
Currently, MaxText uses a few dependencies, such as `mlperf-logging` and `google-jetstream`, that are installed directly from GitHub source. These are defined in `base_requirements/requirements.txt`, and the `seed-env` tool will carry them over to the generated requirements files.

## 5. Verify the New Dependencies
## Step 5: Verify the New Dependencies

Finally, test that the new dependencies install correctly and that MaxText runs as expected.

Expand All @@ -155,4 +155,28 @@ uv pip install -e .[tpu] --resolution=lowest
install_maxtext_github_deps
```

3. **Run tests:** Run MaxText tests to ensure there are no regressions.
3. **Run tests:** Run MaxText tests to ensure there are no regressions.

## Appendix: Install XPK for MaxText Multi-host Workloads

> **_NOTE:_** XPK is only required for multi-host TPU configurations (e.g., v5p-128, v6e-256). For single-host training, XPK is not needed and you can run MaxText directly on your TPU VM.

XPK (Accelerated Processing Kit) is a tool designed to simplify the orchestration and management of workloads on Google Kubernetes Engine (GKE) clusters with TPU or GPU accelerators. In MaxText, we use XPK to submit both pre-training and post-training jobs on multi-host TPU configurations.

For your convenience, we provide a minimal installation path below:
```bash
# Directly install xpk using pip
pip install xpk

# Install kubectl
sudo apt-get update
sudo apt install snapd
sudo snap install kubectl --classic

# Install gke-gcloud-auth-plugin
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt update && sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
```

For detailed setup instructions and advanced features, please refer to the [official XPK documentation](https://github.com/AI-Hypercomputer/xpk).
16 changes: 8 additions & 8 deletions docs/tutorials/posttraining/rl.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ For efficient model inference and response generation during this process, we re
Let's get started!

## Create virtual environment and Install MaxText dependencies
If you have already completed the [MaxText installation](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/install_maxtext.md), you can skip to the next section for post-training dependencies installations. Otherwise, please install `MaxText` using the following commands before proceeding.
If you have already completed the [MaxText installation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html), you can skip to the next section for post-training dependencies installations. Otherwise, please install `MaxText` using the following commands before proceeding.
```bash
# 1. Clone the repository
git clone https://github.com/AI-Hypercomputer/maxtext.git
Expand Down Expand Up @@ -117,7 +117,7 @@ Run the following command for GRPO:
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
model_name=${MODEL} \
tokenizer_path=${TOKENIZER} \
load_parameters_path=${MAXTEXT_CKPT_PATH} \
load_parameters_path=${MAXTEXT_CKPT_PATH}/0/items \
run_name=${RUN_NAME} \
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
hf_access_token=${HF_TOKEN}
Expand All @@ -136,12 +136,12 @@ Run the following command for GSPO:

```
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
model_name=llama3.1-8b \
tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
load_parameters_path=gs://path/to/checkpoint/0/items \
run_name=$WORKLOAD \
base_output_directory=$OUTPUT_PATH \
hf_access_token=$HF_TOKEN \
model_name=${MODEL} \
tokenizer_path=${TOKENIZER} \
load_parameters_path=${MAXTEXT_CKPT_PATH}/0/items \
run_name=${RUN_NAME} \
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
hf_access_token=${HF_TOKEN} \
loss_algo=gspo-token
```

Expand Down
5 changes: 3 additions & 2 deletions docs/tutorials/posttraining/rl_on_multi_host.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ For efficient model inference and response generation during this process, we re
Let's get started!

## Create virtual environment and Install MaxText dependencies
Follow instructions in [Install MaxText](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/install_maxtext.md), but
Follow instructions in [Install MaxText](https://maxtext.readthedocs.io/en/latest/install_maxtext.html), but
recommend creating the virtual environment outside the `maxtext` directory.


Expand Down Expand Up @@ -93,7 +93,7 @@ You can install the required dependencies using either of the following two opti
### Option 1: Installing stable releases of tunix and vllm-tpu
Run the following bash script to create a docker image with all the dependencies of MaxText, Tunix, vLLM and tpu-inference installed.

In addition to MaxText dependencies, primarily, it installs `vllm-tpu` which is [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) and thereby providing TPU inference for vLLM, with unified JAX and PyTorch support.
In addition to MaxText dependencies, primarily, it installs `vllm-tpu` which is [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) and thereby providing TPU inference for vLLM, with unified JAX and PyTorch support. This build process takes approximately 10 to 15 minutes.

```
bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training
Expand All @@ -109,6 +109,7 @@ bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training PO
```

### Upload the dependency docker image along with MaxText code
> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".
```
bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME}
```
Expand Down
3 changes: 2 additions & 1 deletion docs/tutorials/posttraining/sft_on_multi_host.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,13 @@ gcloud auth application-default login
gcloud auth configure-docker
docker run hello-world
```
Then run the following command to create a local Docker image named `maxtext_base_image`.
Then run the following command to create a local Docker image named `maxtext_base_image`. This build process takes approximately 10 to 15 minutes.
```bash
bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training
```

### 1.3. Upload the Docker image to Artifact Registry
> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".
```bash
# Replace `$USER_runner` with your desired image name
export DOCKER_IMAGE_NAME=${USER}_runner
Expand Down
Loading