Revise multimodal doc to be clearer

pamelafox · pamelafox · commit d52de82a5f72 · 2025-08-26T10:38:16.000-07:00
diff --git a/docs/multimodal.md b/docs/multimodal.md
@@ -6,7 +6,7 @@ to better handle documents that contain images, such as financial reports with c
 With this feature enabled, the data ingestion process will extract images from your documents
 using Document Intelligence, store the images in Azure Blob Storage, vectorize the images using the Azure AI Vision service, and store the image embeddings in the Azure AI Search index.
 
-During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o` or `gpt-4o-mini`.
+During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o`, `gpt-4o-mini`, `gpt-5`, or `gpt-5-mini`.
 
 With this feature enabled, the following changes are made:
 
@@ -15,13 +15,11 @@ With this feature enabled, the following changes are made:
 * **Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to the LLM, and ask it to answer the question based on both kinds of sources.
 * **Citations**: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
 
-## Using the feature
-
-### Prerequisites
+## Prerequisites
 
 * The use of a chat completion model that supports multimodal inputs. The default model for the repository is currently `gpt-4.1-mini`, which does support multimodal inputs. The `gpt-4o-mini` technically supports multimodal inputs, but due to how image tokens are calculated, you need a much higher deployment capacity to use it effectively. Please try `gpt-4.1-mini` first, and experiment with other models later.
 
-### Deployment
+## Deployment
 
 1. **Enable multimodal capabilities**
 
@@ -67,38 +65,54 @@ With this feature enabled, the following changes are made:
    * Check the "Supporting content" tab to see the text and images that were used to generate the answer.
    * Open "Developer settings" and try different options for "Included vector fields" and "LLM input sources" to see how they affect the results.
 
-5. **Customize the multimodal approach:**
+## Customize the multimodal approach
 
-   You can customize the RAG flow approach with a few additional environment variables.
+You can customize the RAG flow approach with a few additional environment variables.
+You can also modify those settings in the "Developer Settings" in the chat UI,
+to experiment with different options before committing to them.
 
-   The following variables can be set to either true or false,
-   to control whether Azure AI Search will use text embeddings, image embeddings, or both:
+### Control vector retrieval
 
-   ```shell
-   azd env set RAG_SEARCH_TEXT_EMBEDDINGS true
-   ```
+Set variables to control whether Azure AI Search will do retrieval using the text embeddings, image embeddings, or both.
+By default, it will retrieve using both text and image embeddings.
 
-   ```shell
-   azd env set RAG_SEARCH_IMAGE_EMBEDDINGS true
-   ```
+To disable retrieval with text embeddings, run:
 
-   The following variable can be set to either true or false,
-   to control whether the chat completion model will use text inputs, image inputs, or both:
+```shell
+azd env set RAG_SEARCH_TEXT_EMBEDDINGS false
+```
 
-   ```shell
-   azd env set RAG_SEND_TEXT_SOURCES true
-   ```
+To disable retrieval with image embeddings, run:
 
-   ```shell
-   azd env set RAG_SEND_IMAGE_SOURCES true
-   ```
+```shell
+azd env set RAG_SEARCH_IMAGE_EMBEDDINGS false
+```
+
+Many developers may find that they can turn off image embeddings and still have high quality retrieval, since the text embeddings are based off text chunks that include figure descriptions.
+
+### Control LLM input sources
+
+Set variables to control whether the chat completion model will use text inputs, image inputs, or both:
+
+To disable text inputs, run:
+
+```shell
+azd env set RAG_SEND_TEXT_SOURCES false
+```
+
+To disable image inputs, run:
+
+```shell
+azd env set RAG_SEND_IMAGE_SOURCES false
+```
 
-   You can also modify those settings in the "Developer Settings" in the chat UI,
-   to experiment with different options before committing to them.
+It is unlikely that you would want to turn off text sources, unless your RAG is based on documents that are 100% image-based.
+However, you may want to turn off image inputs to save on token costs and improve performance,
+and you may still see good results with just text inputs, since the inputs contain the figure descriptions.
 
 ## Compatibility
 
-* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we were not able to customize them enough to meet the requirements of this feature. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
+* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we have not integrated them in this project. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
 * This feature is **not** fully compatible with the [agentic retrieval](./agentic_retrieval.md) feature.
 The agent *will* perform the multimodal vector embedding search, but it will not return images in the response,
 so we cannot send the images to the chat completion model.