You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/multimodal.md
+40-26Lines changed: 40 additions & 26 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ to better handle documents that contain images, such as financial reports with c
6
6
With this feature enabled, the data ingestion process will extract images from your documents
7
7
using Document Intelligence, store the images in Azure Blob Storage, vectorize the images using the Azure AI Vision service, and store the image embeddings in the Azure AI Search index.
8
8
9
-
During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o`or `gpt-4o-mini`.
9
+
During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o`, `gpt-4o-mini`, `gpt-5`, or `gpt-5-mini`.
10
10
11
11
With this feature enabled, the following changes are made:
12
12
@@ -15,13 +15,11 @@ With this feature enabled, the following changes are made:
15
15
***Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to the LLM, and ask it to answer the question based on both kinds of sources.
16
16
***Citations**: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
17
17
18
-
## Using the feature
19
-
20
-
### Prerequisites
18
+
## Prerequisites
21
19
22
20
* The use of a chat completion model that supports multimodal inputs. The default model for the repository is currently `gpt-4.1-mini`, which does support multimodal inputs. The `gpt-4o-mini` technically supports multimodal inputs, but due to how image tokens are calculated, you need a much higher deployment capacity to use it effectively. Please try `gpt-4.1-mini` first, and experiment with other models later.
23
21
24
-
###Deployment
22
+
## Deployment
25
23
26
24
1.**Enable multimodal capabilities**
27
25
@@ -67,38 +65,54 @@ With this feature enabled, the following changes are made:
67
65
* Check the "Supporting content" tab to see the text and images that were used to generate the answer.
68
66
* Open "Developer settings" and try different options for "Included vector fields" and "LLM input sources" to see how they affect the results.
69
67
70
-
5.**Customize the multimodal approach:**
68
+
## Customize the multimodal approach
71
69
72
-
You can customize the RAG flow approach with a few additional environment variables.
70
+
You can customize the RAG flow approach with a few additional environment variables.
71
+
You can also modify those settings in the "Developer Settings" in the chat UI,
72
+
to experiment with different options before committing to them.
73
73
74
-
The following variables can be set to either true or false,
75
-
to control whether Azure AI Search will use text embeddings, image embeddings, or both:
74
+
### Control vector retrieval
76
75
77
-
```shell
78
-
azd env set RAG_SEARCH_TEXT_EMBEDDINGS true
79
-
```
76
+
Set variables to control whether Azure AI Search will do retrieval using the text embeddings, image embeddings, or both.
77
+
By default, it will retrieve using both text and image embeddings.
80
78
81
-
```shell
82
-
azd env set RAG_SEARCH_IMAGE_EMBEDDINGS true
83
-
```
79
+
To disable retrieval with text embeddings, run:
84
80
85
-
The following variable can be set to either true or false,
86
-
to control whether the chat completion model will use text inputs, image inputs, or both:
81
+
```shell
82
+
azd env set RAG_SEARCH_TEXT_EMBEDDINGS false
83
+
```
87
84
88
-
```shell
89
-
azd env set RAG_SEND_TEXT_SOURCES true
90
-
```
85
+
To disable retrieval with image embeddings, run:
91
86
92
-
```shell
93
-
azd env set RAG_SEND_IMAGE_SOURCES true
94
-
```
87
+
```shell
88
+
azd env set RAG_SEARCH_IMAGE_EMBEDDINGS false
89
+
```
90
+
91
+
Many developers may find that they can turn off image embeddings and still have high quality retrieval, since the text embeddings are based off text chunks that include figure descriptions.
92
+
93
+
### Control LLM input sources
94
+
95
+
Set variables to control whether the chat completion model will use text inputs, image inputs, or both:
96
+
97
+
To disable text inputs, run:
98
+
99
+
```shell
100
+
azd env set RAG_SEND_TEXT_SOURCES false
101
+
```
102
+
103
+
To disable image inputs, run:
104
+
105
+
```shell
106
+
azd env set RAG_SEND_IMAGE_SOURCES false
107
+
```
95
108
96
-
You can also modify those settings in the "Developer Settings" in the chat UI,
97
-
to experiment with different options before committing to them.
109
+
It is unlikely that you would want to turn off text sources, unless your RAG is based on documents that are 100% image-based.
110
+
However, you may want to turn off image inputs to save on token costs and improve performance,
111
+
and you may still see good results with just text inputs, since the inputs contain the figure descriptions.
98
112
99
113
## Compatibility
100
114
101
-
* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we were not able to customize them enough to meet the requirements of this feature. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
115
+
* This feature is **not** compatible with [integrated vectorization](./deploy_features.md#enabling-integrated-vectorization), as the currently configured built-in skills do not process images or store image embeddings. Azure AI Search does now offer built-in skills for multimodal support, as demonstrated in [azure-ai-search-multimodal-sample](https://github.com/Azure-Samples/azure-ai-search-multimodal-sample), but we have not integrated them in this project. Instead, we are working on making a custom skill based off the data ingestion code in this repository, and hosting that skill on Azure Functions. Stay tuned to the releases to find out when that's available.
102
116
* This feature is **not** fully compatible with the [agentic retrieval](./agentic_retrieval.md) feature.
103
117
The agent *will* perform the multimodal vector embedding search, but it will not return images in the response,
104
118
so we cannot send the images to the chat completion model.
0 commit comments