You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/customization.md
+16-21Lines changed: 16 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,42 +40,37 @@ Typically, the primary backend code you'll want to customize is the `app/backend
40
40
41
41
The chat tab uses the approach programmed in [chatreadretrieveread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/chatreadretrieveread.py).
42
42
43
-
1. It calls the OpenAI ChatCompletion API to turn the user question into a good search query, using the prompt and tools from [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty).
44
-
2. It queries Azure AI Search for search results for that query (optionally using the vector embeddings for that query).
45
-
3. It then calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty). That call includes the past message history as well (or as many messages fit inside the model's token limit).
43
+
1.**Query rewriting**: It calls the OpenAI ChatCompletion API to turn the user question into a good search query, using the prompt and tools from [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty).
44
+
2.**Search**: It queries Azure AI Search for search results for that query (optionally using the vector embeddings for that query).
45
+
3.**Answering**: It then calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty). That call includes the past message history as well (or as many messages fit inside the model's token limit).
46
46
47
47
The prompts are currently tailored to the sample data since they start with "Assistant helps the company employees with their healthcare plan questions, and questions about the employee handbook." Modify the [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty) and [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty) prompts to match your data.
48
48
49
-
##### Chat with vision
49
+
##### Chat with multimodal feature
50
50
51
-
TODO FIX THIS!
51
+
If you followed the instructions in [the multimodal guide](multimodal.md) to enable multimodal RAG,
52
+
there are several differences in the chat approach:
52
53
53
-
If you followed the instructions in [the multimodal guide](multimodal.md) to enable the vision approach and the "Use GPT vision model" option is selected, then the chat tab will use the `chatreadretrievereadvision.py` approach instead. This approach is similar to the `chatreadretrieveread.py` approach, with a few differences:
54
-
55
-
1. Step 1 is the same as before, except it uses the GPT-4 Vision model instead of the default GPT-3.5 model.
56
-
2. For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the `imageEmbeddings` fields in the indexed documents. For each matching document, it downloads the image blob and converts it to a base 64 encoding.
57
-
3. When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the GPT4 Vision model (similar to this [documentation example](https://platform.openai.com/docs/guides/vision/quick-start)). The model generates a response that includes citations to the images, and the UI renders the base64 encoded images when a citation is clicked.
58
-
59
-
The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
54
+
1.**Query rewriting**: Unchanged.
55
+
2.**Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
56
+
3.**Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
60
57
61
58
#### Ask tab
62
59
63
60
The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
64
61
65
-
1. It queries Azure AI Search for search results for the user question (optionally using the vector embeddings for that question).
66
-
2. It then combines the search results and user question, and calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty).
62
+
1.**Search**: It queries Azure AI Search for search results for the user question (optionally using the vector embeddings for that question).
63
+
2.**Answering**: It then combines the search results and user question, and calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty).
67
64
68
65
The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions." Modify [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty) to match your data.
69
66
70
-
#### Ask with vision
71
-
72
-
TODO FIX THIS!
73
-
If you followed the instructions in [the multimodal guide](multimodal.md) to enable the vision approach and the "Use GPT vision model" option is selected, then the ask tab will use the `retrievethenreadvision.py` approach instead. This approach is similar to the `retrievethenread.py` approach, with a few differences:
67
+
#### Ask with multimodal feature
74
68
75
-
1. For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the `imageEmbeddings` fields in the indexed documents. For each matching document, it downloads the image blob and converts it to a base 64 encoding.
76
-
2. When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the GPT4 Vision model (similar to this [documentation example](https://platform.openai.com/docs/guides/vision/quick-start)). The model generates a response that includes citations to the images, and the UI renders the base64 encoded images when a citation is clicked.
69
+
If you followed the instructions in [the multimodal guide](multimodal.md)to enable multimodal RAG,
70
+
there are several differences in the ask approach:
77
71
78
-
The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd". Modify the [ask_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question_vision.prompty) prompt to match your data.
72
+
1.**Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
73
+
2.**Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
Copy file name to clipboardExpand all lines: docs/deploy_features.md
+3-5Lines changed: 3 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -219,7 +219,7 @@ If you have already deployed:
219
219
220
220
## Enabling multimodal embeddings and answering
221
221
222
-
⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization). TODO:
222
+
⚠️ This feature is not currently compatible with [agentic retrieval](./agentic_retrieval.md).
223
223
224
224
When your documents include images, you can optionally enable this feature that can
225
225
use image embeddings when searching and also use images when answering questions.
@@ -229,8 +229,8 @@ Learn more in the [multimodal guide](./multimodal.md).
229
229
## Enabling media description with Azure Content Understanding
230
230
231
231
⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization).
232
-
233
-
It is compatible with the [multimodal feature](./multimodal.md), but the features provide similar functionality. TODO: UPDATE
232
+
It is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
233
+
so you may want to enable the multimodal feature instead or as well.
234
234
235
235
By default, if your documents contain image-like figures, the data ingestion process will ignore those figures,
236
236
so users will not be able to ask questions about them.
@@ -318,8 +318,6 @@ azd env set USE_SPEECH_OUTPUT_BROWSER true
318
318
319
319
## Enabling Integrated Vectorization
320
320
321
-
⚠️ This feature is not currently compatible with the [multimodal feature](./multimodal.md). TODO: UPDATE
322
-
323
321
Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/blog/azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in-azure-ai-search/3960809). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
324
322
325
323
To enable integrated vectorization with this sample:
Copy file name to clipboardExpand all lines: docs/multimodal.md
+18-29Lines changed: 18 additions & 29 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,28 +1,20 @@
1
-
# RAG chat: Using GPT vision model with RAG approach
1
+
# RAG chat: Support for multimodal documents
2
2
3
-
TODO: UPDATE THIS!
4
-
5
-
[📺 Watch: (RAG Deep Dive series) Multimedia data ingestion](https://www.youtube.com/watch?v=5FfIy7G2WW0)
6
-
7
-
This repository includes an optional feature that uses the GPT vision model to generate responses based on retrieved content. This feature is useful for answering questions based on the visual content of documents, such as photos and charts.
8
-
9
-
## How it works
3
+
This repository includes an optional feature that uses multimodal embedding models and multimodal chat completion models
4
+
to better handle documents that contain images, such as financial reports with charts and graphs.
10
5
11
6
With this feature enabled, the data ingestion process will extract images from your documents
12
7
using Document Intelligence, store the images in Azure Blob Storage, vectorize the images using the Azure AI Vision service, and store the image embeddings in the Azure AI Search index.
13
8
14
9
During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o` or `gpt-4o-mini`.
15
10
16
-
----OLD----
17
-
When this feature is enabled, the following changes are made to the application:
11
+
With this feature enabled, the following changes are made:
18
12
19
-
***Search index**: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings).
20
-
***Data ingestion**: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index.
21
-
***Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources.
13
+
***Search index**: We add a new field "images" to the Azure AI Search index to store information about the images associated with a chunk. The field is a complex field that contains the embedding returned by the multimodal Azure AI Vision API, the bounding box, and the URL of the image in Azure Blob Storage.
14
+
***Data ingestion**: In addition to the usual data ingestion flow, the document extraction process will extract images from the documents using Document Intelligence, store the images in Azure Blob Storage with a citation at the top border, and vectorize the images using the Azure AI Vision service.
15
+
***Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to the LLM, and ask it to answer the question based on both kinds of sources.
22
16
***Citations**: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
23
17
24
-
For more details on how this feature works, read [this blog post](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/integrating-vision-into-rag-applications/4239460) or watch [this video](https://www.youtube.com/live/C3Zq3z4UQm4?si=SSPowBBJoTBKZ9WW&t=89).
25
-
26
18
## Using the feature
27
19
28
20
### Prerequisites
@@ -31,21 +23,15 @@ For more details on how this feature works, read [this blog post](https://techco
31
23
32
24
### Deployment
33
25
34
-
1.**Enable multimodal capabilities:**
35
-
36
-
First, make sure you do *not* have integrated vectorization enabled, since that is currently incompatible: (TODO!)
26
+
1.**Enable multimodal capabilities**
37
27
38
-
```shell
39
-
azd env set USE_FEATURE_INT_VECTORIZATION false
40
-
```
41
-
42
-
Then set the azd environment variable to enable the multimodal feature:
28
+
Set the azd environment variable to enable the multimodal feature:
43
29
44
30
```shell
45
31
azd env set USE_MULTIMODAL true
46
32
```
47
33
48
-
2.**Provision the multimodal resources:**
34
+
2.**Provision the multimodal resources**
49
35
50
36
Either run `azd up` if you haven't run it before, or run `azd provision` to provision the multimodal resources. This will create a new Azure AI Vision account and update the Azure AI Search index to include the new image embedding field.
51
37
@@ -73,11 +59,13 @@ For more details on how this feature works, read [this blog post](https://techco
* Access the developer options in the web app and select "Use GPT vision model".
78
-
* New sample questions will show up in the UI that are based on the sample financial document.
79
-
* Try out a question and see the answer generated by the GPT vision model.
80
-
* Check the 'Thought process' and 'Supporting content' tabs.
62
+
63
+

64
+
65
+
* If you're using the sample data, try one of the sample questions about the financial documents.
66
+
* Check the "Thought process" tab to see how the multimodal approach was used
67
+
* Check the "Supporting content" tab to see the text and images that were used to generate the answer.
68
+
* Open "Developer settings" and try different options for "Included vector fields" and "LLM input sources" to see how they affect the results.
81
69
82
70
5.**Customize the multimodal approach:**
83
71
@@ -114,3 +102,4 @@ For more details on how this feature works, read [this blog post](https://techco
114
102
The agent *will* perform the multimodal vector embedding search, but it will not return images in the response,
115
103
so we cannot send the images to the chat completion model.
116
104
* This feature *is* compatible with the [reasoning models](./reasoning.md) feature, as long as you use a model that [supports image inputs](https://learn.microsoft.com/azure/ai-services/openai/how-to/reasoning?tabs=python-secure%2Cpy#api--feature-support).
105
+
* This feature is *mostly* compatible with [integrated vectorization](./integrated_vectorization.md). The extraction process will not be exactly the same, so the chunks will not be identical, and the extracted images will not contain citations.
0 commit comments