Merge pull request #2493 from AI-Hypercomputer:hengtaoguo-doc

Google-ML-Automation · Google-ML-Automation · commit 3d51c99f51b6 · 2025-10-29T09:38:57.000-07:00
PiperOrigin-RevId: 825585112
diff --git a/docs/_static/multimodal_overview.png b/docs/_static/multimodal_overview.png
diff --git a/docs/guides.md b/docs/guides.md
@@ -40,4 +40,5 @@ guides/checkpointing_solutions/multi_tier_checkpointing.md
 guides/jax_ai_libraries_chosen.md
 guides/xprof_user_guide.md
 guides/megascale_hang_playbook.md
+guides/multimodal.md
 ```
diff --git a/docs/guides/multimodal.md b/docs/guides/multimodal.md
@@ -0,0 +1,139 @@
+
+
+# Multimodal Support on MaxText
+
+This document provides a guide to use the multimodal functionalities in MaxText including:
+- **Checkpoint Conversion**: Convert a MaxText-compatible orbax checkpoint from HuggingFace.
+- **Multimodal Decode**: Inference with text+images as input.
+- **Supervised Fine-Tuning (SFT)**: Apply SFT to the model using a visual-question-answering dataset.
+
+The following table provides a list of models and modalities we currently support:
+| Models | Input Modalities | Output Modalities |
+| :---- | :---- | :---- |
+| - Gemma3-4B/12B/27B<br>- Llama4-Scout/Maverick | Text, images | Text |
+
+## Introduction
+
+Multimodal Large Language Models (LLMs) extend traditional text-only models by incorporating multiple input modalities such as images, audio, and video. For each non-text modality, the architecture typically follows a three-stage pipeline: 
+- **Data Preprocessing**: We apply modality-specific preprocessing steps to prepare the raw input data (e.g., image resizing and normalization), transforming them into a format which neural networks can understand.
+- **Modality-Specific Encoders**: Modality-specific encoders will transform the preprocessed data into high-dimensional representations (e.g., vision transformers for images).
+- **Projection and Merge**: Projection layers will map these modality-specific embeddings into the shared embedding space of the language model, usually aligned with the dimension of text embeddings. These projected embeddings are then merged with text token embeddings, allowing the unified model to process and reason over multiple modalities simultaneously within a single coherent framework.
+
+
+<img src="../_static/multimodal_overview.png" alt="Illustration of multimodal MaxText." width="60%">
+*Figure 1: Overview of multimodal dataflow in MaxText.*
+
+## Checkpoint Conversion
+
+Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace (README). This tool is used for the Gemma3 model family. Use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:
+
+```shell
+export HF_ACCESS_TOKEN=hf_...
+export MAXTEXT_CKPT_GCS_PATH=gs://...
+python -m MaxText.utils.ckpt_conversion.to_maxtext MaxText/configs/base.yml \
+    model_name=gemma3-4b \
+    hf_access_token=$HF_ACCESS_TOKEN \
+    base_output_directory=$MAXTEXT_CKPT_GCS_PATH \
+    use_multimodal=true \
+    scan_layers=false
+```
+
+For the Llama4 model family, we are using a separate checkpoint conversion script (of note,we will gradually migrate all checkpoint conversion scripts to the above consolidated tool soon):
+
+```shell
+export LOCAL_HF_MODEL_PATH=...  # Need to pre-download the safetensors from HuggingFace
+export MAXTEXT_CKPT_GCS_PATH=gs://...
+python -m MaxText.llama4_ckpt_unscanned \
+    --model-size=llama4-17b-16e \
+    --huggingface-checkpoint=True \
+    --base-model-path=$LOCAL_HF_MODEL_PATH \
+    --maxtext-model-path=$MAXTEXT_CKPT_GCS_PATH
+```
+
+## Multimodal Decode
+MaxText supports multimodal decoding, allowing you to input text with multiple images to get a text output. To use this feature, you need three main settings:
+- `use_multimodal=True`: Initializes the multimodal preprocessing steps and network components.
+- `prompt`: Specifies the position of image placeholder tokens in your input. If you don't manually place them, MaxText will automatically append the required placeholder (e.g., `<start_of_image>` for Gemma3, `<|image|>` for Llama4). The exact placeholder is listed under the `image_placeholder` field in each model's configuration file.
+- `image_path`: The path(s) to the image file(s) MaxText will load and process.
+
+Since each model uses a unique native chatting template from its pretraining, we've implemented these specific templates within `multimodal_utils.py` and apply them directly to your prompt.
+
+To run a forward pass and verify the model's output, use the following command:
+
+```shell
+# Gemma3 decode
+python -m MaxText.decode \
+    MaxText/configs/base.yml \
+    model_name=gemma3-4b \
+    hf_access_token=$HF_ACCESS_TOKEN \
+    tokenizer_path=assets/tokenizer.gemma3 \
+    load_parameters_path=$MAXTEXT_CKPT_GCS_PATH/0/items \
+    per_device_batch_size=1 \
+    run_name=ht_test \
+    max_prefill_predict_length=272 \
+    max_target_length=300 \
+    steps=1 \
+    async_checkpointing=false \
+    scan_layers=false \
+    use_multimodal=true \
+    prompt='Describe image <start_of_image>' \
+    image_path='MaxText/test_assets/test_image.jpg' \
+    attention='dot_product'
+```
+
+The decoding results will look like this:
+```
+Input `<start_of_turn>user
+Describe image <start_of_image><end_of_turn>
+<start_of_turn>model
+` -> `Here's a description of the image:
+
+**Overall Impression:** The image is a bright, expansive cityscape view of Seattle, Washington, with`
+```
+
+To decode with multiple images at once, you can provide multiple image paths like this:
+
+```
+python -m MaxText.decode \
+    MaxText/configs/base.yml \
+    model_name=gemma3-4b \
+    ... \
+    image_path=/path/to/image1.jpg,/path/to/image2.jpg \
+    prompt="Describe each image in a short sentence." # <start_of_image> will be added to prompt if not provided
+    # or prompt="Describe each image in a short sentence: <start_of_image> and <start_of_image>"
+```
+
+For larger models such as Llama4-Scout/Maverick, we suggest to run the decoding on a TPU cluster such as v5p-16.
+
+## Supervised Fine-Tuning
+
+Supervised Fine-Tuning (SFT) of multimodal LLMs in MaxText focuses specifically on post-training optimization rather than pre-training from scratch, which is currently not supported. The SFT process typically involves training on Visual Question Answering (VQA) datasets where the model learns to generate accurate text responses based on both visual and textual inputs. During this fine-tuning phase, we recommend to freeze the pre-trained encoder layers (such as vision transformers) to preserve their learned visual representations, while the projection layers and LLM decoder components remain trainable. This selective training strategy allows the model to adapt the cross-modal alignment and text generation capabilities without disrupting the robust feature extraction abilities of the encoders, ultimately leading to improved performance on multimodal understanding and reasoning tasks while maintaining computational efficiency. This is achieved by setting `freeze_vision_encoder_params=True` in [sft-vision-chartqa.yml](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/configs/sft-vision-chartqa.yml).
+
+Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as an example to demonstrate SFT functionality:
+
+
+```shell
+python -m MaxText.sft_trainer MaxText/configs/sft-vision-chartqa.yml \
+    run_name=$idx \
+    model_name=gemma3-4b \
+    tokenizer_path="google/gemma-3-4b-pt" \
+    per_device_batch_size=1 \
+    max_prefill_predict_length=1024 \
+    max_target_length=2048 \
+    steps=200 \
+    scan_layers=false \
+    async_checkpointing=False \
+    attention=dot_product \
+    dataset_type=hf hf_path=parquet hf_access_token=$HF_ACCESS_TOKEN \
+    hf_train_files=gs://aireenmei-multipod/dataset/hf/chartqa/train-* \
+    base_output_directory=$BASE_OUTPUT_DIRECTORY \
+    load_parameters_path=$UNSCANNED_CKPT_PATH \
+    dtype=bfloat16 weight_dtype=bfloat16 sharding_tolerance=0.05
+```
+
+## Other Recommendations
+- **Setting appropriate prefill length**: To prevent truncation and ensure your full input (text + image) is processed, the prefill length should be set longer than the total combined length of your text tokens and image tokens. This combined length makes up the final sequence fed to the decoder. We recommend to estimate the combined sequence length from your full input and then add a buffer when setting your `max_prefill_predict_length` for decoding. Token estimation rules:
+    - For text tokens, a good estimate is $\text{Text Tokens} \approx 1.3 \times \text{Number of Words in Prompt}$.
+    - For Gemma3, each image is resized to 896*896 and contributes 256 tokens. $\text{Total Tokens} \approx \text{Text Tokens} + \text{Number of Images} * 256$.
+    - For Llama4 models, each image is dynamically tiled based on its size, with each resulting tile contributing 144 tokens. $\text{Total Tokens} \approx \text{Text Tokens} + \text{Number of Tiles of Image1} * 144 + ... + \text{Number of Tiles of ImageN} * 144$.
+
diff --git a/src/MaxText/examples/multimodal_gemma3_demo.ipynb b/src/MaxText/examples/multimodal_gemma3_demo.ipynb
@@ -0,0 +1,212 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI-Hypercomputer/maxtext/blob/main/src/MaxText/examples/multimodal_gemma3_demo.ipynb)\n",
+        "\n",
+        "# Gemma3 Multimodal Inference/Training Demo"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Overview\n",
+        "\n",
+        "This notebook demonstrates MaxText's multimodal features, using Gemma3-4B as an example:\n",
+        "- Convert an orbax checkpoint from HuggingFace.\n",
+        "- Apply decoding on a single image input.\n",
+        "- Apply SFT to the converted checkpoint on ChartQA dataset.\n",
+        "\n",
+        "Given the relative small size of Gemma3-4B, you can run this colab on a v4-8, v5p-8 or v6e-4 TPU VM. However, we recommend using [XPK](https://github.com/AI-Hypercomputer/maxtext/blob/64d6d9b425e78dde94c37a82bb13ba5606e74b1b/docs/guides/run_maxtext_via_xpk.md) to schedule a training workload on a TPU cluster for better performance."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Get Your Hugging Face Token\n",
+        "\n",
+        "To access model checkpoint from the Hugging Face Hub, you need to authenticate with a personal access token.\n",
+        "\n",
+        "**Follow these steps to get your token:**\n",
+        "\n",
+        "1.  **Navigate to the Access Tokens page** in your Hugging Face account settings. You can go there directly by visiting this URL:\n",
+        "    *   [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)\n",
+        "\n",
+        "2.  **Create a new token** by clicking the **\"+ Create new token\"** button.\n",
+        "\n",
+        "3.  **Give your token a name** and assign it a **`read` role**. The `read` role is sufficient for downloading models.\n",
+        "\n",
+        "4.  **Copy the generated token**. You will need to paste it in `HF_TOKEN`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "5KPyOE8e9WbO"
+      },
+      "outputs": [],
+      "source": [
+        "#Install maxtext and dependencies\n",
+        "# 1. Install uv, a fast Python package installer\n",
+        "!pip install uv\n",
+        "\n",
+        "# 2. Install MaxText and its dependencies\n",
+        "!uv pip install maxtext --resolution=lowest\n",
+        "!install_maxtext_github_deps"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "import MaxText\n",
+        "\n",
+        "# Get the root directory of the MaxText\n",
+        "MAXTEXT_REPO_ROOT=os.path.dirname(MaxText.__file__)\n",
+        "\n",
+        "# Define model name\n",
+        "MODEL_NAME=\"gemma3-4b\"\n",
+        "\n",
+        "# Use either a GCS path or a local path for the model checkpoint\n",
+        "MODEL_CHECKPOINT_PATH = f\"gs://your-gcs-bucket/{MODEL_NAME}\"\n",
+        "\n",
+        "# Replace with your actual Hugging Face token\n",
+        "HF_TOKEN = \"your_huggingface_token_here\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Convert Checkpoint from HuggingFace"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!python3 -m MaxText.utils.ckpt_conversion.to_maxtext \\\n",
+        "    $MAXTEXT_REPO_ROOT/configs/base.yml \\\n",
+        "    model_name=$MODEL_NAME \\\n",
+        "    hf_access_token=$HF_TOKEN \\\n",
+        "    base_output_directory=$MODEL_CHECKPOINT_PATH \\\n",
+        "    use_multimodal=true \\\n",
+        "    scan_layers=false"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Decode on One Image"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!python -m MaxText.decode \\\n",
+        "    $MAXTEXT_REPO_ROOT/configs/base.yml \\\n",
+        "    model_name=$MODEL_NAME \\\n",
+        "    tokenizer_path=assets/tokenizer.gemma3 \\\n",
+        "    load_parameters_path=$MODEL_CHECKPOINT_PATH/0/items \\\n",
+        "    per_device_batch_size=1 \\\n",
+        "    run_name=ht_test max_prefill_predict_length=272 \\\n",
+        "    max_target_length=300 \\\n",
+        "    steps=1 \\\n",
+        "    async_checkpointing=false \\\n",
+        "    scan_layers=false \\\n",
+        "    use_multimodal=true \\\n",
+        "    prompt='Describe image <start_of_image>' \\\n",
+        "    image_path=$MAXTEXT_REPO_ROOT/test_assets/test_image.jpg \\\n",
+        "    attention='dot_product'"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Supervised Finetuning (SFT)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Running the cell below will trigger a 10-step SFT on your TPU VM (v4-8, v5p-8, or v6e-4). However, we recommend using [XPK](https://github.com/AI-Hypercomputer/maxtext/blob/64d6d9b425e78dde94c37a82bb13ba5606e74b1b/docs/guides/run_maxtext_via_xpk.md) to schedule a training workload on a TPU cluster for better performance. After the SFT, the result checkpoint will be saved to `BASE_OUTPUT_DIRECTORY`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Define SFT output directory\n",
+        "BASE_OUTPUT_DIRECTORY=f\"gs://your-gcs-bucket/{MODEL_NAME}-sft\"\n",
+        "PRE_TRAINED_MODEL_TOKENIZER=\"google/gemma-3-4b-it\"\n",
+        "WORKLOAD_NAME=f\"{MODEL_NAME}-chartqa-sft\"\n",
+        "STEPS=10\n",
+        "PER_DEVICE_BATCH_SIZE=1\n",
+        "\n",
+        "!python -m MaxText.sft_trainer \\\n",
+        "    $MAXTEXT_REPO_ROOT/configs/sft-vision-chartqa.yml \\\n",
+        "    run_name=$WORKLOAD_NAME \\\n",
+        "    model_name=$MODEL_NAME \\\n",
+        "    tokenizer_path=$PRE_TRAINED_MODEL_TOKENIZER \\\n",
+        "    hf_access_token=$HF_TOKEN \\\n",
+        "    load_parameters_path=$MODEL_CHECKPOINT_PATH/0/items \\\n",
+        "    base_output_directory=$BASE_OUTPUT_DIRECTORY \\\n",
+        "    per_device_batch_size=$PER_DEVICE_BATCH_SIZE \\\n",
+        "    steps=$STEPS \\\n",
+        "    max_prefill_predict_length=1024 \\\n",
+        "    max_target_length=2048 \\\n",
+        "    checkpoint_period=1000 \\\n",
+        "    scan_layers=False \\\n",
+        "    async_checkpointing=True \\\n",
+        "    enable_checkpointing=True \\\n",
+        "    attention=dot_product \\\n",
+        "    max_num_images_per_example=1 \\\n",
+        "    dataset_type=hf profiler=xplane"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "TPU",
+    "colab": {
+      "gpuType": "V5E1",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "python3.12",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.7"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
diff --git a/src/MaxText/input_pipeline/_hf_data_processing.py b/src/MaxText/input_pipeline/_hf_data_processing.py
@@ -57,6 +57,7 @@ def vision_sft_preprocessing_pipeline(
         },
         remove_columns=image_column,  # Drop the original image columns
     )
+    image_column = "images"
 
   dataset = dataset.select_columns(text_columns + [image_column])
   if image_column != "images":

Original file line number	Diff line number	Diff line change
`@@ -57,6 +57,7 @@ def vision_sft_preprocessing_pipeline(`
`57`	`57`	`},`
`58`	`58`	`remove_columns=image_column, # Drop the original image columns`
`59`	`59`	`)`
	`60`	`+ image_column = "images"`
`60`	`61`
`61`	`62`	`dataset = dataset.select_columns(text_columns + [image_column])`
`62`	`63`	`if image_column != "images":`