More UXR fixes

hengtaoguo · hengtaoguo · commit 8629e8be9615 · 2025-12-06T05:54:35.000Z
XPK quick guides

fix flags

clarification

amend

fix
diff --git a/docs/install_maxtext.md b/docs/install_maxtext.md
@@ -122,7 +122,7 @@ seed-env \
   --output-dir=generated_gpu_artifacts
 ```
 
-## 4. Update Project Files
+## Step 4: Update Project Files
 
 After generating the new requirements, you need to update the files in the MaxText repository.
 
@@ -133,7 +133,7 @@ After generating the new requirements, you need to update the files in the MaxTe
 2.  **Update `extra_deps_from_github.txt` (if necessary):**
     Currently, MaxText uses a few dependencies, such as `mlperf-logging` and `google-jetstream`, that are installed directly from GitHub source. These are defined in `base_requirements/requirements.txt`, and the `seed-env` tool will carry them over to the generated requirements files.
 
-## 5. Verify the New Dependencies
+## Step 5: Verify the New Dependencies
 
 Finally, test that the new dependencies install correctly and that MaxText runs as expected.
 
@@ -155,4 +155,28 @@ uv pip install -e .[tpu] --resolution=lowest
 install_maxtext_github_deps
 ```
 
-3.  **Run tests:** Run MaxText tests to ensure there are no regressions.
+3.  **Run tests:** Run MaxText tests to ensure there are no regressions.
+
+## Appendix: Install XPK for MaxText Multi-host Workloads
+
+> **_NOTE:_** XPK is only required for multi-host TPU configurations (e.g., v5p-128, v6e-256). For single-host training, XPK is not needed and you can run MaxText directly on your TPU VM.
+
+XPK (Accelerated Processing Kit) is a tool designed to simplify the orchestration and management of workloads on Google Kubernetes Engine (GKE) clusters with TPU or GPU accelerators. In MaxText, we use XPK to submit both pre-training and post-training jobs on multi-host TPU configurations.
+
+For your convenience, we provide a minimal installation path below:
+```bash
+# Directly install xpk using pip
+pip install xpk
+
+# Install kubectl
+sudo apt-get update
+sudo apt install snapd
+sudo snap install kubectl --classic
+
+# Install gke-gcloud-auth-plugin
+echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
+curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
+sudo apt update && sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin
+```
+
+For detailed setup instructions and advanced features, please refer to the [official XPK documentation](https://github.com/AI-Hypercomputer/xpk).
diff --git a/docs/tutorials/posttraining/rl.md b/docs/tutorials/posttraining/rl.md
@@ -117,7 +117,7 @@ Run the following command for GRPO:
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
   model_name=${MODEL} \
   tokenizer_path=${TOKENIZER} \
-  load_parameters_path=${MAXTEXT_CKPT_PATH} \
+  load_parameters_path=${MAXTEXT_CKPT_PATH}/0/items \
   run_name=${RUN_NAME} \
   base_output_directory=${BASE_OUTPUT_DIRECTORY} \
   hf_access_token=${HF_TOKEN}
@@ -136,12 +136,12 @@ Run the following command for GSPO:
 
 ```
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
-  model_name=llama3.1-8b \
-  tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
-  load_parameters_path=gs://path/to/checkpoint/0/items \
-  run_name=$WORKLOAD \
-  base_output_directory=$OUTPUT_PATH \
-  hf_access_token=$HF_TOKEN \
+  model_name=${MODEL} \
+  tokenizer_path=${TOKENIZER} \
+  load_parameters_path=${MAXTEXT_CKPT_PATH}/0/items \
+  run_name=${RUN_NAME} \
+  base_output_directory=${BASE_OUTPUT_DIRECTORY} \
+  hf_access_token=${HF_TOKEN} \
   loss_algo=gspo-token
 ```
 
diff --git a/docs/tutorials/posttraining/rl_on_multi_host.md b/docs/tutorials/posttraining/rl_on_multi_host.md
@@ -93,7 +93,7 @@ You can install the required dependencies using either of the following two opti
 ### Option 1: Installing stable releases of tunix and vllm-tpu
 Run the following bash script to create a docker image with all the dependencies of MaxText, Tunix, vLLM and tpu-inference installed.
 
-In addition to MaxText dependencies, primarily, it installs `vllm-tpu` which is [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) and thereby providing TPU inference for vLLM, with unified JAX and PyTorch support.
+In addition to MaxText dependencies, primarily, it installs `vllm-tpu` which is [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) and thereby providing TPU inference for vLLM, with unified JAX and PyTorch support. This build process takes approximately 10 to 15 minutes.
  
 ```
 bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training
diff --git a/docs/tutorials/posttraining/sft_on_multi_host.md b/docs/tutorials/posttraining/sft_on_multi_host.md
@@ -43,7 +43,7 @@ gcloud auth application-default login
 gcloud auth configure-docker
 docker run hello-world
 ```
-Then run the following command to create a local Docker image named `maxtext_base_image`.
+Then run the following command to create a local Docker image named `maxtext_base_image`. This build process takes approximately 10 to 15 minutes.
 ```bash
 bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training
 ```