You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/guides/multimodal.md
+15-12Lines changed: 15 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ This document provides a guide to use the multimodal functionalities in MaxText
7
7
-**Multimodal Decode**: Inference with text+images as input.
8
8
-**Supervised Fine-Tuning (SFT)**: Apply SFT to the model using a visual-question-answering dataset.
9
9
10
-
The following table provides a list of models and modalities we currently support:
10
+
We also provide a [colab](https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/examples/multimodal_gemma3_demo.ipynb) for multimodal features demonstration. The following table provides a list of models and modalities we currently support:
11
11
| Models | Input Modalities | Output Modalities |
12
12
| :---- | :---- | :---- |
13
13
| - Gemma3-4B/12B/27B<br>- Llama4-Scout/Maverick | Text, images | Text |
@@ -113,22 +113,25 @@ Here, we use [ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA) as
0 commit comments