Merge pull request #2681 from AI-Hypercomputer:mohit/grpo_doc

Google-ML-Automation · Google-ML-Automation · commit ed517cf80d9a · 2025-11-25T14:50:22.000-08:00
PiperOrigin-RevId: 836820005
diff --git a/docs/tutorials/grpo.md b/docs/tutorials/grpo.md
@@ -46,18 +46,64 @@ Primarily, it installs `vllm-tpu` which is [vllm](https://github.com/vllm-projec
 
 You can also locally git clone [tunix](https://github.com/google/tunix) and install using the instructions [here](https://github.com/google/tunix?tab=readme-ov-file#installation). Similarly install [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) from source following the instructions [here](https://docs.vllm.ai/projects/tpu/en/latest/getting_started/installation/#install-from-source)
 
+## Setup the following environment variables before running GRPO
+
+Setup following environment variables before running GRPO
+
+```bash
+# -- Model configuration --
+export HF_MODEL='llama3.1-8b-Instruct'
+export MODEL='llama3.1-8b'
+export TOKENIZER='meta-llama/Llama-3.1-8B-Instruct'
+export HF_TOKEN=<Hugging Face access token>
+
+# -- MaxText configuration --
+export BASE_OUTPUT_DIRECTORY=<output directory to store run logs> # e.g., gs://my-bucket/my-output-directory
+
+export RUN_NAME=<name for this run> # e.g., $(date +%Y-%m-%d-%H-%M-%S)
+export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items
+```
+
+## Get your model checkpoint
+
+You can convert a Hugging Face checkpoint to MaxText format using the `src/MaxText/utils/ckpt_conversion/to_maxtext.py` script. This is useful if you have a pre-trained model from Hugging Face that you want to use with MaxText.
+
+First, ensure you have the necessary dependencies installed. Then, run the conversion script on a CPU machine. For large models, it is recommended to use the --lazy_load_tensors flag to reduce memory usage during conversion. This command will download the Hugging Face model and convert it to the MaxText format, saving it to the specified GCS bucket.
+
+```bash
+python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
+
+python3 -m MaxText.utils.ckpt_conversion.to_maxtext src/MaxText/configs/base.yml \
+    model_name=${HF_MODEL} \
+    hf_access_token=${HF_TOKEN} \
+    base_output_directory=${MAXTEXT_CKPT_PATH} \
+    scan_layers=True hardware=cpu skip_jax_distributed_system=true
+
+# Example of converting Llama3.1-70B using --lazy_load_tensor=true which uses around 86GB of RAM
+
+python3 -m MaxText.utils.ckpt_conversion.to_maxtext MaxText/configs/base.yml \
+    model_name=llama3.1-70b \
+    hf_access_token=${HF_TOKEN} \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME} \
+    scan_layers=True \
+    hardware=cpu skip_jax_distributed_system=true \
+    --lazy_load_tensors=true
+```
+
+
+
 ## Run GRPO
 
 Finally, run the command
 
 ```
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
-  model_name=llama3.1-8b \
-  tokenizer_path=meta-llama/Llama-3.1-8B-Instruct \
-  load_parameters_path=gs://path/to/checkpoint/0/items \
-  run_name=$WORKLOAD \
-  base_output_directory=$OUTPUT_PATH \
-  hf_access_token=$HF_TOKEN
+  model_name=${MODEL} \
+  tokenizer_path=${TOKENIZER} \
+  load_parameters_path=${MAXTEXT_CKPT_PATH} \
+  run_name=${RUN_NAME} \
+  base_output_directory=${BASE_OUTPUT_DIRECTORY} \
+  hf_access_token=${HF_TOKEN}
 ```
 
 The overview of the what this run will do is as follows:
diff --git a/docs/tutorials/grpo_with_pathways.md b/docs/tutorials/grpo_with_pathways.md
@@ -29,6 +29,50 @@ Furthermore, we use Pathways for [orchestration](https://cloud.google.com/ai-hyp
 Follow instructions in [Install MaxText](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/install_maxtext.md), but 
 recommend creating the virtual environment outside the `maxtext` directory.
 
+
+## Setup the following environment variables before running GRPO
+
+Setup following environment variables before running GRPO
+
+```bash
+# -- Model configuration --
+export HF_MODEL='llama3.1-70b-Instruct'
+export MODEL='llama3.1-70b'
+export TOKENIZER='meta-llama/Llama-3.1-70B-Instruct'
+export HF_TOKEN=<Hugging Face access token>
+
+# -- MaxText configuration --
+export BASE_OUTPUT_DIRECTORY=<output directory to store run logs> # e.g., gs://my-bucket/my-output-directory
+export RUN_NAME=llama-3-70b-grpo
+export MAXTEXT_CKPT_PATH=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME}/0/items
+
+# -- Workload configuration --
+export WORKLOAD=${RUN_NAME}
+export TPU_TYPE='v5p-128'
+export TPU_CLUSTER=<cluster name>
+export PROJECT_ID=<GCP project ID>
+export ZONE=<zone name>
+```
+
+## Get your model checkpoint
+
+You can convert a Hugging Face checkpoint to MaxText format using the `src/MaxText/utils/ckpt_conversion/to_maxtext.py` script. This is useful if you have a pre-trained model from Hugging Face that you want to use with MaxText.
+
+First, ensure you have the necessary dependencies installed. Then, run the conversion script on a CPU machine. For large models, it is recommended to use the `--lazy_load_tensors` flag to reduce memory usage during conversion. \
+For example, converting a Llama3.1-70B model scanned checkpoint using `--lazy_load_tensors=true` will use around 200GB of RAM and completes in ~10 mins. This command will download the Hugging Face model and convert it to the MaxText format, saving it to the specified GCS bucket.
+
+```bash
+python3 -m pip install torch --index-url https://download.pytorch.org/whl/cpu
+
+# using --lazy_load_tensors=true here will reduce the memory usage. eg, Llama3.1-70B conversion takes around 86GB of RAM
+python3 -m MaxText.utils.ckpt_conversion.to_maxtext MaxText/configs/base.yml \
+    model_name=${HF_MODEL} \
+    hf_access_token=${HF_TOKEN} \
+    base_output_directory=${BASE_OUTPUT_DIRECTORY}/${RUN_NAME} \
+    scan_layers=true checkpoint_storage_use_ocdbt=false checkpoint_storage_use_zarr3=false \
+    skip_jax_distributed_system=true --lazy_load_tensors=true
+```
+
 ## Build and Upload MaxText Docker Image with Tunix, vLLM, tpu-inference dependencies
 
 ### Installing stable releases of tunix and vllm-tpu
@@ -45,28 +89,30 @@ You can also use `bash dependencies/scripts/docker_build_dependency_image.sh MOD
 ### Install from locally git cloned repo's
 
 You can also locally git clone [tunix](https://github.com/google/tunix), [tpu-inference](https://github.com/vllm-project/tpu-inference), [vllm](https://github.com/vllm-project/vllm.git) and then use the following command to build a docker image using them: 
-`bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training POST_TRAINING_SOURCE=local`
+```
+bash dependencies/scripts/docker_build_dependency_image.sh MODE=post-training POST_TRAINING_SOURCE=local
+```
 
 ### Upload the dependency docker image along with MaxText code
 ```
-bash docker_upload_runner.sh CLOUD_IMAGE_NAME=path/to/gcr.io
+bash dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME}
 ```
 
 ### Submit your jobs
 
 Please create a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster), and you can submit the `train_rl.py` script via [XPK](https://github.com/AI-Hypercomputer/xpk)
 ```
 xpk workload create-pathways --workload $WORKLOAD \
---docker-image path/to/gcr.io:latest --cluster $TPU_CLUSTER \
+--docker-image <path/to/gcr.io> --cluster $TPU_CLUSTER \
 --tpu-type=$TPU_TYPE --num-slices=1  --zone=$ZONE \
 --project=$PROJECT_ID --priority=high \
---command "HF_TOKEN=$HF_TOKEN TF_CPP_MIN_LOG_LEVEL=0 JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE='1' # Llama3.1-70B-Instruct
+--command "TF_CPP_MIN_LOG_LEVEL=0 JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 ENABLE_PATHWAYS_PERSISTENCE='1' \
 python3 -m src.MaxText.rl.train_rl src/MaxText/configs/rl.yml \
-  model_name=llama3.1-70b \
-  tokenizer_path=meta-llama/Llama-3.1-70B-Instruct \
-  load_parameters_path=gs://path/to/checkpoint/0/items \
-  run_name=$WORKLOAD \
-  base_output_directory=$OUTPUT_PATH \
+  model_name=${MODEL} \
+  tokenizer_path=${TOKENIZER} \
+  load_parameters_path=${MAXTEXT_CKPT_PATH} \
+  run_name=${RUN_NAME} \
+  base_output_directory=${BASE_OUTPUT_DIRECTORY} \
   hf_access_token=$HF_TOKEN"
 ```
 
diff --git a/src/MaxText/pyconfig.py b/src/MaxText/pyconfig.py
@@ -18,6 +18,7 @@
 import os
 import sys
 from typing import Any
+import copy
 
 import jax
 import jax.numpy as jnp
@@ -151,6 +152,13 @@ def __init__(self, pydantic_config: types.MaxTextConfig):
 
     object.__setattr__(self, "_flat_config", final_dict)
 
+  def __deepcopy__(self, memo):
+    new_pydantic_config = copy.deepcopy(self._pydantic_config, memo)
+    return HyperParameters(new_pydantic_config)
+
+  def tree_flatten(self):
+    return (), self
+
   def __getattr__(self, attr: str) -> Any:
     """Provides attribute-style access to the final configuration dictionary."""
     if attr in self._flat_config:
diff --git a/src/MaxText/utils/ckpt_conversion/to_maxtext.py b/src/MaxText/utils/ckpt_conversion/to_maxtext.py
@@ -279,19 +279,34 @@ def __repr__(self):
 
 
 class LazyTensorHandler(type_handlers.NumpyHandler):
-  """Custom Orbax handler for LazyTensor to avoid typestr collision with np.ndarray."""
+  """
+  Custom Orbax handler for LazyTensor.
+
+  It masquerades as a standard NumpyHandler so that the resulting checkpoint
+  has the standard 'array_metadatas' structure and can be loaded by
+  standard MaxText instances.
+  """
 
-  def typestr(self):
-    return "LazyTensor"
+  async def serialize(self, value, *args, **kwargs):
+    # MATERIALIZE: Trigger the lazy load (__array__) explicitly before saving.
+    # This ensures the parent NumpyHandler receives a real np.ndarray.
+    if hasattr(value, "__array__"):
+      value = np.array(value)
+
+    return await super().serialize(value, *args, **kwargs)
 
 
 # Register LazyTensor with the custom handler.
 # It's safe to register this globally even if eager loading is used.
-type_handlers.register_type_handler(LazyTensor, LazyTensorHandler())
+type_handlers.register_type_handler(LazyTensor, LazyTensorHandler(), override=True)
 
 
 def _build_multi_axis_stacked_tensor(
-    hf_source_keys: List[List[str]], tensor_getter_fn: Callable[[str], np.ndarray], hook_fns: Any
+    hf_source_keys: List[List[str]],
+    tensor_getter_fn: Callable[[str], np.ndarray],
+    hook_fns: Any,
+    target_shape: tuple,
+    config,
 ) -> np.ndarray:
   """Builds a MaxText tensor by stacking HF weights along two axes (experts and layers).
 
@@ -303,18 +318,24 @@ def _build_multi_axis_stacked_tensor(
                       Outer list iterates experts, inner list iterates layers.
       tensor_getter_fn: A callable that takes a HF key and returns the tensor (as numpy array).
       hook_fns: The hook function(s) to apply to each individual weight.
+      target_shape: The final shape of the target MaxText tensor.
+      config: The MaxText pyconfig object.
 
   Returns:
       The final, assembled NumPy array for the MaxText parameter.
   """
   all_expert_tensors = []
+  # The hook function needs the shape of an individual slice, not the full stacked tensor.
+  # For multi-axis stacking (experts, layers, ...), the slice shape is target_shape[2:]
+  mt_slice_shape = target_shape[2:]
+
   # Outer loop iterates through experts
   for layer_keys_for_expert in hf_source_keys:
     layer_tensors_for_expert = []
     # Inner loop iterates through layers for the current expert
     for hf_key_single in layer_keys_for_expert:
       hf_tensor_numpy = tensor_getter_fn(hf_key_single)
-      processed_hf_tensor = apply_hook_fns(hf_tensor_numpy, None, hook_fns)
+      processed_hf_tensor = apply_hook_fns(hf_tensor_numpy, mt_slice_shape, hook_fns)
       layer_tensors_for_expert.append(processed_hf_tensor)
     all_expert_tensors.append(np.stack(layer_tensors_for_expert, axis=0))
   return np.stack(all_expert_tensors, axis=0)
@@ -514,7 +535,14 @@ def _loader(getter, key, shape, hook):
       # Stacked mapping
       if isinstance(hf_source_keys_or_key[0], list):
         # Case 2: Multi-Axis Stacked
-        load_fn = partial(_build_multi_axis_stacked_tensor, hf_source_keys_or_key, tensor_getter, hook_fn)
+        load_fn = partial(
+            _build_multi_axis_stacked_tensor,
+            hf_source_keys_or_key,
+            tensor_getter,
+            hook_fn,
+            mt_target_shape_final,
+            config,
+        )
       else:
         # Case 3: Single-Axis Stacked
         load_fn = partial(
diff --git a/tests/train_using_ragged_dot_smoke_test.py b/tests/train_using_ragged_dot_smoke_test.py
@@ -37,39 +37,41 @@ class Train(parameterized.TestCase):
   def test_tiny_config(self, quantization: str):
     test_tmpdir = os.environ.get("TEST_TMPDIR", gettempdir())
     outputs_dir = os.environ.get("TEST_UNDECLARED_OUTPUTS_DIR", test_tmpdir)
-    train_main([
-        None,
-        os.path.join(MAXTEXT_PKG_DIR, "configs", "base.yml"),
-        f"base_output_directory={test_tmpdir}",
-        "run_name=ragged_dot_smoke_test",
-        "base_emb_dim=128",
-        "base_num_query_heads=4",
-        "base_num_kv_heads=4",
-        "base_mlp_dim=128",
-        "base_moe_mlp_dim=128",
-        "base_num_decoder_layers=8",
-        "head_dim=128",
-        # TODO(b/441100085): When changing the decoder_block we might
-        # need to adjust the tiling.
-        "decoder_block=deepseek",
-        "attention_type=mla",
-        "num_experts=2",
-        # Enable sparse_matmul.
-        "sparse_matmul=True",
-        # Enable ragged_dot.
-        "megablox=False",
-        f'quantization="{quantization}"',
-        "use_qwix_quantization=True",
-        "per_device_batch_size=2",
-        "max_target_length=1024",
-        "dataset_type=synthetic",
-        "steps=10",
-        "enable_checkpointing=False",
-        "enable_goodput_recording=False",
-        "enable_checkpoint_cloud_logger=False",
-        "monitor_goodput=False",
-        f"metrics_file={os.path.join(outputs_dir, 'metrics.json')}",
-    ])
+    train_main(
+        [
+            None,
+            os.path.join(MAXTEXT_PKG_DIR, "configs", "base.yml"),
+            f"base_output_directory={test_tmpdir}",
+            "run_name=ragged_dot_smoke_test",
+            "base_emb_dim=128",
+            "base_num_query_heads=4",
+            "base_num_kv_heads=4",
+            "base_mlp_dim=128",
+            "base_moe_mlp_dim=128",
+            "base_num_decoder_layers=8",
+            "head_dim=128",
+            # TODO(b/441100085): When changing the decoder_block we might
+            # need to adjust the tiling.
+            "decoder_block=deepseek",
+            "attention_type=mla",
+            "num_experts=2",
+            # Enable sparse_matmul.
+            "sparse_matmul=True",
+            # Enable ragged_dot.
+            "megablox=False",
+            f'quantization="{quantization}"',
+            "use_qwix_quantization=True",
+            "per_device_batch_size=2",
+            "max_target_length=1024",
+            "dataset_type=synthetic",
+            "steps=10",
+            "enable_checkpointing=False",
+            "enable_goodput_recording=False",
+            "enable_checkpoint_cloud_logger=False",
+            "monitor_goodput=False",
+            f"metrics_file={os.path.join(outputs_dir, 'metrics.json')}",
+        ]
+    )
 
 
 if __name__ == "__main__":