Skip to content

Commit c836743

Browse files
committed
amend
1 parent 8c32897 commit c836743

File tree

4 files changed

+7
-31
lines changed

4 files changed

+7
-31
lines changed

docs/install_maxtext.md

Lines changed: 0 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -156,27 +156,3 @@ install_maxtext_github_deps
156156
```
157157

158158
3. **Run tests:** Run MaxText tests to ensure there are no regressions.
159-
160-
## Appendix: Install XPK for MaxText Multi-host Workloads
161-
162-
> **_NOTE:_** XPK is only required for multi-host TPU configurations (e.g., v5p-128, v6e-256). For single-host training, XPK is not needed and you can run MaxText directly on your TPU VM.
163-
164-
XPK (Accelerated Processing Kit) is a tool designed to simplify the orchestration and management of workloads on Google Kubernetes Engine (GKE) clusters with TPU or GPU accelerators. In MaxText, we use XPK to submit both pre-training and post-training jobs on multi-host TPU configurations.
165-
166-
For your convenience, we provide a minimal installation path below:
167-
```bash
168-
# Directly install xpk using pip
169-
pip install xpk
170-
171-
# Install kubectl
172-
sudo apt-get update
173-
sudo apt install snapd
174-
sudo snap install kubectl --classic
175-
176-
# Install gke-gcloud-auth-plugin
177-
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
178-
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
179-
sudo apt update && sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
180-
```
181-
182-
For detailed setup instructions and advanced features, please refer to the [official XPK documentation](https://github.com/AI-Hypercomputer/xpk).

docs/tutorials/posttraining/rl.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ export HF_TOKEN=<Hugging Face access token>
7878
export BASE_OUTPUT_DIRECTORY=<output directory to store run logs> # e.g., gs://my-bucket/my-output-directory
7979

8080
export RUN_NAME=<name for this run> # e.g., $(date +%Y-%m-%d-%H-%M-%S)
81-
export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items
81+
export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items # Actual checkpoint saved with an extra /0/items path suffix
8282
```
8383

8484
## Get your model checkpoint
@@ -93,7 +93,7 @@ python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
9393
python3 -m MaxText.utils.ckpt_conversion.to_maxtext src/MaxText/configs/base.yml \
9494
model_name=${HF_MODEL} \
9595
hf_access_token=${HF_TOKEN} \
96-
base_output_directory=${MAXTEXT_CKPT_PATH} \
96+
base_output_directory=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME} \
9797
scan_layers=True hardware=cpu skip_jax_distributed_system=true
9898

9999
# Example of converting Llama3.1-70B using --lazy_load_tensor=true which uses around 86GB of RAM
@@ -117,7 +117,7 @@ Run the following command for GRPO:
117117
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
118118
model_name=${MODEL} \
119119
tokenizer_path=${TOKENIZER} \
120-
load_parameters_path=${MAXTEXT_CKPT_PATH}/0/items \
120+
load_parameters_path=${MAXTEXT_CKPT_PATH} \
121121
run_name=${RUN_NAME} \
122122
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
123123
hf_access_token=${HF_TOKEN}
@@ -138,7 +138,7 @@ Run the following command for GSPO:
138138
python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
139139
model_name=${MODEL} \
140140
tokenizer_path=${TOKENIZER} \
141-
load_parameters_path=${MAXTEXT_CKPT_PATH}/0/items \
141+
load_parameters_path=${MAXTEXT_CKPT_PATH} \
142142
run_name=${RUN_NAME} \
143143
base_output_directory=${BASE_OUTPUT_DIRECTORY} \
144144
hf_access_token=${HF_TOKEN} \
@@ -147,7 +147,7 @@ python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
147147

148148
The overview of what this run will do is as follows:
149149

150-
1. We load a policy model and a reference model. Both are copies of `Llama3.1-8b-Instruct`.
150+
1. We load a policy model and a reference model. Both are copies of the model checkpoint you specified (e.g., `Llama3.1-8b-Instruct`).
151151
2. Evaluate the policy model's performance on GSM8K math reasoning benchmark.
152152
3. Train the policy model using GSPO.
153153
4. Evaluate the policy model's performance on GSM8K math reasoning benchmark after the post-training with GSPO.

docs/tutorials/posttraining/rl_on_multi_host.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE
116116

117117
## Submit your RL workload via Pathways
118118

119-
Please create a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster), and you can submit the `train_rl.py` script via [XPK](https://github.com/AI-Hypercomputer/xpk).
119+
Please create a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster), and you can submit the `train_rl.py` script via [XPK](https://github.com/AI-Hypercomputer/xpk). We also provide a quick guide for XPK installation and usage [here](https://maxtext.readthedocs.io/en/latest/run_maxtext/run_maxtext_via_xpk.html).
120120

121121
### Submit GRPO workload
122122
```

docs/tutorials/posttraining/sft_on_multi_host.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$DOCKER_IMAGE
5858
The `docker_upload_runner.sh` script uploads your Docker image to Artifact Registry.
5959

6060
## 2. Install XPK
61-
Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#installation-via-pip).
61+
Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk?tab=readme-ov-file#installation-via-pip). We also provide a quick guide for XPK installation and usage [here](https://maxtext.readthedocs.io/en/latest/run_maxtext/run_maxtext_via_xpk.html).
6262

6363
## 3. Create GKE cluster
6464
Use a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster).

0 commit comments

Comments
 (0)