Skip to content

Commit d52de82

Browse files
committed
Revise multimodal doc to be clearer
1 parent a0c3b41 commit d52de82

File tree

1 file changed

+40
-26
lines changed

1 file changed

+40
-26
lines changed

docs/multimodal.md

Lines changed: 40 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ to better handle documents that contain images, such as financial reports with c
66
With this feature enabled, the data ingestion process will extract images from your documents
77
using Document Intelligence, store the images in Azure Blob Storage, vectorize the images using the Azure AI Vision service, and store the image embeddings in the Azure AI Search index.
88

9-
During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o` or `gpt-4o-mini`.
9+
During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o`, `gpt-4o-mini`, `gpt-5`, or `gpt-5-mini`.
1010

1111
With this feature enabled, the following changes are made:
1212

@@ -15,13 +15,11 @@ With this feature enabled, the following changes are made:
1515
* **Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to the LLM, and ask it to answer the question based on both kinds of sources.
1616
* **Citations**: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
1717

18-
## Using the feature
19-
20-
### Prerequisites
18+
## Prerequisites
2119

2220
* The use of a chat completion model that supports multimodal inputs. The default model for the repository is currently `gpt-4.1-mini`, which does support multimodal inputs. The `gpt-4o-mini` technically supports multimodal inputs, but due to how image tokens are calculated, you need a much higher deployment capacity to use it effectively. Please try `gpt-4.1-mini` first, and experiment with other models later.
2321

24-
### Deployment
22+
## Deployment
2523

2624
1. **Enable multimodal capabilities**
2725

@@ -67,38 +65,54 @@ With this feature enabled, the following changes are made:
6765
* Check the "Supporting content" tab to see the text and images that were used to generate the answer.
6866
* Open "Developer settings" and try different options for "Included vector fields" and "LLM input sources" to see how they affect the results.
6967

70-
5. **Customize the multimodal approach:**
68+
## Customize the multimodal approach
7169

72-
You can customize the RAG flow approach with a few additional environment variables.
70+
You can customize the RAG flow approach with a few additional environment variables.
71+
You can also modify those settings in the "Developer Settings" in the chat UI,
72+
to experiment with different options before committing to them.
7373

74-
The following variables can be set to either true or false,
75-
to control whether Azure AI Search will use text embeddings, image embeddings, or both:
74+
### Control vector retrieval
7675

77-
```shell
78-
azd env set RAG_SEARCH_TEXT_EMBEDDINGS true
79-
```
76+
Set variables to control whether Azure AI Search will do retrieval using the text embeddings, image embeddings, or both.
77+
By default, it will retrieve using both text and image embeddings.
8078

81-
```shell
82-
azd env set RAG_SEARCH_IMAGE_EMBEDDINGS true
83-
```
79+
To disable retrieval with text embeddings, run:
8480

85-
The following variable can be set to either true or false,
86-
to control whether the chat completion model will use text inputs, image inputs, or both:
81+
```shell
82+
azd env set RAG_SEARCH_TEXT_EMBEDDINGS false
83+
```
8784

88-
```shell
89-
azd env set RAG_SEND_TEXT_SOURCES true
90-
```
85+
To disable retrieval with image embeddings, run:
9186

92-
```shell
93-
azd env set RAG_SEND_IMAGE_SOURCES true
94-
```
87+
```shell
88+
azd env set RAG_SEARCH_IMAGE_EMBEDDINGS false
89+
```
90+
91+
Many developers may find that they can turn off image embeddings and still have high quality retrieval, since the text embeddings are based off text chunks that include figure descriptions.
92+
93+
### Control LLM input sources
94+
95+
Set variables to control whether the chat completion model will use text inputs, image inputs, or both:
96+
97+
To disable text inputs, run:
98+
99+
```shell
100+
azd env set RAG_SEND_TEXT_SOURCES false
101+
```
102+
103+
To disable image inputs, run:
104+
105+
```shell
106+
azd env set RAG_SEND_IMAGE_SOURCES false
107+
```
95108

96-
You can also modify those settings in the "Developer Settings" in the chat UI,
97-
to experiment with different options before committing to them.
109+
It is unlikely that you would want to turn off text sources, unless your RAG is based on documents that are 100% image-based.
110+
However, you may want to turn off image inputs to save on token costs and improve performance,
111+
and you may still see good results with just text inputs, since the inputs contain the figure descriptions.
98112

99113
## Compatibility
100114

101-
* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we were not able to customize them enough to meet the requirements of this feature. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
115+
* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we have not integrated them in this project. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
102116
* This feature is **not** fully compatible with the [agentic retrieval](./agentic_retrieval.md) feature.
103117
The agent *will* perform the multimodal vector embedding search, but it will not return images in the response,
104118
so we cannot send the images to the chat completion model.

0 commit comments

Comments
 (0)