Skip to content

Commit 610d7fb

Browse files
committed
Doc updates for multimodal and python3.12
clarify fix
1 parent 0b24223 commit 610d7fb

File tree

3 files changed

+18
-10
lines changed

3 files changed

+18
-10
lines changed

docs/guides/multimodal.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ Multimodal Large Language Models (LLMs) extend traditional text-only models by i
1919
- **Modality-Specific Encoders**: Modality-specific encoders will transform the preprocessed data into high-dimensional representations (e.g., vision transformers for images).
2020
- **Projection and Merge**: Projection layers will map these modality-specific embeddings into the shared embedding space of the language model, usually aligned with the dimension of text embeddings. These projected embeddings are then merged with text token embeddings, allowing the unified model to process and reason over multiple modalities simultaneously within a single coherent framework.
2121

22-
23-
<img src="../_static/multimodal_overview.png" alt="Illustration of multimodal MaxText." width="60%">
22+
![Illustration of multimodal MaxText.](../_static/multimodal_overview.png)
2423
*Figure 1: Overview of multimodal dataflow in MaxText.*
2524

25+
2626
## Checkpoint Conversion
2727

28-
Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace (README). This tool is used for the Gemma3 model family. Use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:
28+
Recently we have onboarded a new centralized tool for bidirectional checkpoint conversion between MaxText and HuggingFace ([README](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/utils/ckpt_conversion/README.md)). This tool is used for the Gemma3 model family. Use this command to convert an unscanned checkpoint from HuggingFace to MaxText, and save it to `MAXTEXT_CKPT_GCS_PATH`:
2929

3030
```shell
3131
export HF_ACCESS_TOKEN=hf_...
@@ -38,7 +38,7 @@ python -m MaxText.utils.ckpt_conversion.to_maxtext MaxText/configs/base.yml \
3838
scan_layers=false
3939
```
4040

41-
For the Llama4 model family, we are using a separate checkpoint conversion script (of note,we will gradually migrate all checkpoint conversion scripts to the above consolidated tool soon):
41+
For the Llama4 model family, we are using a separate checkpoint conversion script (of note, we will gradually migrate all checkpoint conversion scripts to the above consolidated tool soon):
4242

4343
```shell
4444
export LOCAL_HF_MODEL_PATH=... # Need to pre-download the safetensors from HuggingFace
@@ -107,7 +107,7 @@ For larger models such as Llama4-Scout/Maverick, we suggest to run the decoding
107107

108108
## Supervised Fine-Tuning
109109

110-
Supervised Fine-Tuning (SFT) of multimodal LLMs in MaxText focuses specifically on post-training optimization rather than pre-training from scratch, which is currently not supported. The SFT process typically involves training on Visual Question Answering (VQA) datasets where the model learns to generate accurate text responses based on both visual and textual inputs. During this fine-tuning phase, we recommend to freeze the pre-trained encoder layers (such as vision transformers) to preserve their learned visual representations, while the projection layers and LLM decoder components remain trainable. This selective training strategy allows the model to adapt the cross-modal alignment and text generation capabilities without disrupting the robust feature extraction abilities of the encoders, ultimately leading to improved performance on multimodal understanding and reasoning tasks while maintaining computational efficiency. This is achieved by setting `freeze_vision_encoder_params=True` in [sft-vision-chartqa.yml](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/configs/sft-vision-chartqa.yml).
110+
Supervised Fine-Tuning (SFT) of multimodal LLMs in MaxText focuses specifically on post-training; we don't yet support pre-training multimodal models from scratch. The SFT process typically involves training on Visual Question Answering (VQA) datasets where the model learns to generate accurate text responses based on both visual and textual inputs. During this fine-tuning phase, we recommend to freeze the pre-trained encoder layers (such as vision transformers) to preserve their learned visual representations, while the projection layers and LLM decoder components remain trainable. This selective training strategy allows the model to adapt the cross-modal alignment and text generation capabilities without disrupting the robust feature extraction abilities of the encoders, ultimately leading to improved performance on multimodal understanding and reasoning tasks while maintaining computational efficiency. This is achieved by setting `freeze_vision_encoder_params=True` in [sft-vision-chartqa.yml](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/configs/sft-vision-chartqa.yml).
111111

112112
Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as an example to demonstrate SFT functionality:
113113

@@ -136,7 +136,13 @@ python -m MaxText.sft_trainer \
136136

137137
## Other Recommendations
138138
- **Setting appropriate prefill length**: To prevent truncation and ensure your full input (text + image) is processed, the prefill length should be set longer than the total combined length of your text tokens and image tokens. This combined length makes up the final sequence fed to the decoder. We recommend to estimate the combined sequence length from your full input and then add a buffer when setting your `max_prefill_predict_length` for decoding. Token estimation rules:
139-
- For text tokens, a good estimate is $\text{Text Tokens} \approx 1.3 \times \text{Number of Words in Prompt}$.
140-
- For Gemma3, each image is resized to 896*896 and contributes 256 tokens. $\text{Total Tokens} \approx \text{Text Tokens} + \text{Number of Images} * 256$.
141-
- For Llama4 models, each image is dynamically tiled based on its size, with each resulting tile contributing 144 tokens. $\text{Total Tokens} \approx \text{Text Tokens} + \text{Number of Tiles of Image1} * 144 + ... + \text{Number of Tiles of ImageN} * 144$.
139+
- For text tokens, a good estimate is:
140+
141+
$\text{Text Tokens} \approx 1.3 \times \text{Number of Words in Prompt}$.
142+
- For Gemma3, each image is resized to 896*896 and contributes 256 tokens:
143+
144+
$\text{Total Tokens} \approx \text{Text Tokens} + \text{Number of Images} * 256$.
145+
- For Llama4 models, each image is dynamically tiled based on its size, with each resulting tile contributing 144 tokens:
146+
147+
$\text{Total Tokens} \approx \text{Text Tokens} + 144 \times \sum_{i=1}^{N} \text{Number of Tiles of Image}_i$.
142148

docs/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616

1717
# MaxText
1818

19+
> **_NOTE:_** We recommend running MaxText with Python 3.12, as it is our primary supported version. Other Python versions may encounter compatibility issues.
20+
1921
MaxText is a high performance, highly scalable, open-source LLM library and reference implementation written in pure Python/[JAX](https://docs.jax.dev/en/latest/jax-101.html) and targeting Google Cloud TPUs and GPUs for training.
2022

2123
MaxText provides a library of high performance models to choose from, including Gemma, Llama, DeepSeek, Qwen, and Mistral. For each of these models, MaxText supports pre-training (up to tens of thousands of chips) and scalable post-training, with popular techniques like Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO, a type of Reinforcement Learning).
@@ -28,7 +30,7 @@ Check out our [Read The Docs site](https://maxtext.readthedocs.io/en/latest/) or
2830

2931
## 🔥 Latest news 🔥
3032

31-
* [September 5, 2025] MaxText has moved to an `src` layout as part of [RESTRUCTURE.md]
33+
* [September 5, 2025] MaxText has moved to an `src` layout as part of [RESTRUCTURE.md](https://github.com/AI-Hypercomputer/maxtext/blob/main/RESTRUCTURE.md).
3234
* [August 13, 2025] The Qwen3 2507 MoE family of models is now supported: MoEs: 235B Thinking & 280B Coder as well as existing dense models: 0.6B, 4B, 8B, 14B, and 32B.
3335
* [July 27, 2025] Updated TFLOPS/s calculation ([PR](https://github.com/AI-Hypercomputer/maxtext/pull/1988)) to account for causal attention, dividing the attention flops in half. Accounted for sliding window and chunked attention reduced attention flops in [PR](https://github.com/AI-Hypercomputer/maxtext/pull/2009) and [PR](https://github.com/AI-Hypercomputer/maxtext/pull/2030). Changes impact large sequence configs, as explained in this [doc](https://github.com/AI-Hypercomputer/maxtext/blob/main/docs/guides/performance_metrics.md)
3436
* [July 16, 2025] We will be restructuring the MaxText repository for improved organization and clarity. Please review the [proposed structure](https://github.com/AI-Hypercomputer/maxtext/blob/main/RESTRUCTURE.md) and provide feedback.

src/MaxText/examples/multimodal_gemma3_demo.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
"- Apply decoding on a single image input.\n",
2121
"- Apply SFT to the converted checkpoint on ChartQA dataset.\n",
2222
"\n",
23-
"Given the relative small size of Gemma3-4B, you can run this colab on a v4-8, v5p-8 or v6e-4 TPU VM. However, we recommend using [XPK](https://github.com/AI-Hypercomputer/maxtext/blob/64d6d9b425e78dde94c37a82bb13ba5606e74b1b/docs/guides/run_maxtext_via_xpk.md) to schedule a training workload on a TPU cluster for better performance."
23+
"Given the relative small size of Gemma3-4B, you can run this colab on a v4-8, v5p-8 or v6e-4 TPU VM. You can also use [XPK](https://github.com/AI-Hypercomputer/maxtext/blob/64d6d9b425e78dde94c37a82bb13ba5606e74b1b/docs/guides/run_maxtext_via_xpk.md) to run training workloads on a TPU cluster."
2424
]
2525
},
2626
{

0 commit comments

Comments
 (0)