Skip to content

Commit 659d401

Browse files
committed
Address more TODOs
1 parent 4592837 commit 659d401

File tree

7 files changed

+39
-58
lines changed

7 files changed

+39
-58
lines changed

app/backend/prepdocslib/pdfparser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@ def crop_image_from_pdf_page(
299299
# Scale the bounding box to 72 DPI
300300
bbox_dpi = 72
301301
# We multiply using unpacking to ensure the resulting tuple has the correct number of elements
302-
x0, y0, x1, y1 = (x * bbox_dpi for x in bbox_inches)
302+
x0, y0, x1, y1 = (round(x * bbox_dpi, 2) for x in bbox_inches)
303303
bbox_pixels = (x0, y0, x1, y1)
304304
rect = pymupdf.Rect(bbox_pixels)
305305
# Assume that the PDF has 300 DPI,
File renamed without changes.

docs/customization.md

Lines changed: 16 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -40,42 +40,37 @@ Typically, the primary backend code you'll want to customize is the `app/backend
4040

4141
The chat tab uses the approach programmed in [chatreadretrieveread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/chatreadretrieveread.py).
4242

43-
1. It calls the OpenAI ChatCompletion API to turn the user question into a good search query, using the prompt and tools from [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty).
44-
2. It queries Azure AI Search for search results for that query (optionally using the vector embeddings for that query).
45-
3. It then calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty). That call includes the past message history as well (or as many messages fit inside the model's token limit).
43+
1. **Query rewriting**: It calls the OpenAI ChatCompletion API to turn the user question into a good search query, using the prompt and tools from [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty).
44+
2. **Search**: It queries Azure AI Search for search results for that query (optionally using the vector embeddings for that query).
45+
3. **Answering**: It then calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty). That call includes the past message history as well (or as many messages fit inside the model's token limit).
4646

4747
The prompts are currently tailored to the sample data since they start with "Assistant helps the company employees with their healthcare plan questions, and questions about the employee handbook." Modify the [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty) and [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty) prompts to match your data.
4848

49-
##### Chat with vision
49+
##### Chat with multimodal feature
5050

51-
TODO FIX THIS!
51+
If you followed the instructions in [the multimodal guide](multimodal.md) to enable multimodal RAG,
52+
there are several differences in the chat approach:
5253

53-
If you followed the instructions in [the multimodal guide](multimodal.md) to enable the vision approach and the "Use GPT vision model" option is selected, then the chat tab will use the `chatreadretrievereadvision.py` approach instead. This approach is similar to the `chatreadretrieveread.py` approach, with a few differences:
54-
55-
1. Step 1 is the same as before, except it uses the GPT-4 Vision model instead of the default GPT-3.5 model.
56-
2. For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the `imageEmbeddings` fields in the indexed documents. For each matching document, it downloads the image blob and converts it to a base 64 encoding.
57-
3. When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the GPT4 Vision model (similar to this [documentation example](https://platform.openai.com/docs/guides/vision/quick-start)). The model generates a response that includes citations to the images, and the UI renders the base64 encoded images when a citation is clicked.
58-
59-
The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
54+
1. **Query rewriting**: Unchanged.
55+
2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
56+
3. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
6057

6158
#### Ask tab
6259

6360
The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
6461

65-
1. It queries Azure AI Search for search results for the user question (optionally using the vector embeddings for that question).
66-
2. It then combines the search results and user question, and calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty).
62+
1. **Search**: It queries Azure AI Search for search results for the user question (optionally using the vector embeddings for that question).
63+
2. **Answering**: It then combines the search results and user question, and calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty).
6764

6865
The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions." Modify [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty) to match your data.
6966

70-
#### Ask with vision
71-
72-
TODO FIX THIS!
73-
If you followed the instructions in [the multimodal guide](multimodal.md) to enable the vision approach and the "Use GPT vision model" option is selected, then the ask tab will use the `retrievethenreadvision.py` approach instead. This approach is similar to the `retrievethenread.py` approach, with a few differences:
67+
#### Ask with multimodal feature
7468

75-
1. For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the `imageEmbeddings` fields in the indexed documents. For each matching document, it downloads the image blob and converts it to a base 64 encoding.
76-
2. When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the GPT4 Vision model (similar to this [documentation example](https://platform.openai.com/docs/guides/vision/quick-start)). The model generates a response that includes citations to the images, and the UI renders the base64 encoded images when a citation is clicked.
69+
If you followed the instructions in [the multimodal guide](multimodal.md) to enable multimodal RAG,
70+
there are several differences in the ask approach:
7771

78-
The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd". Modify the [ask_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question_vision.prompty) prompt to match your data.
72+
1. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
73+
2. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
7974

8075
#### Making settings overrides permanent
8176

docs/deploy_features.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -219,7 +219,7 @@ If you have already deployed:
219219
220220
## Enabling multimodal embeddings and answering
221221
222-
⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization). TODO:
222+
⚠️ This feature is not currently compatible with [agentic retrieval](./agentic_retrieval.md).
223223
224224
When your documents include images, you can optionally enable this feature that can
225225
use image embeddings when searching and also use images when answering questions.
@@ -229,8 +229,8 @@ Learn more in the [multimodal guide](./multimodal.md).
229229
## Enabling media description with Azure Content Understanding
230230
231231
⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization).
232-
233-
It is compatible with the [multimodal feature](./multimodal.md), but the features provide similar functionality. TODO: UPDATE
232+
It is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
233+
so you may want to enable the multimodal feature instead or as well.
234234
235235
By default, if your documents contain image-like figures, the data ingestion process will ignore those figures,
236236
so users will not be able to ask questions about them.
@@ -318,8 +318,6 @@ azd env set USE_SPEECH_OUTPUT_BROWSER true
318318
319319
## Enabling Integrated Vectorization
320320
321-
⚠️ This feature is not currently compatible with the [multimodal feature](./multimodal.md). TODO: UPDATE
322-
323321
Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/blog/azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in-azure-ai-search/3960809). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
324322
325323
To enable integrated vectorization with this sample:

docs/images/multimodal.png

295 KB
Loading

docs/multimodal.md

Lines changed: 18 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,20 @@
1-
# RAG chat: Using GPT vision model with RAG approach
1+
# RAG chat: Support for multimodal documents
22

3-
TODO: UPDATE THIS!
4-
5-
[📺 Watch: (RAG Deep Dive series) Multimedia data ingestion](https://www.youtube.com/watch?v=5FfIy7G2WW0)
6-
7-
This repository includes an optional feature that uses the GPT vision model to generate responses based on retrieved content. This feature is useful for answering questions based on the visual content of documents, such as photos and charts.
8-
9-
## How it works
3+
This repository includes an optional feature that uses multimodal embedding models and multimodal chat completion models
4+
to better handle documents that contain images, such as financial reports with charts and graphs.
105

116
With this feature enabled, the data ingestion process will extract images from your documents
127
using Document Intelligence, store the images in Azure Blob Storage, vectorize the images using the Azure AI Vision service, and store the image embeddings in the Azure AI Search index.
138

149
During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o` or `gpt-4o-mini`.
1510

16-
----OLD----
17-
When this feature is enabled, the following changes are made to the application:
11+
With this feature enabled, the following changes are made:
1812

19-
* **Search index**: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings).
20-
* **Data ingestion**: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index.
21-
* **Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources.
13+
* **Search index**: We add a new field "images" to the Azure AI Search index to store information about the images associated with a chunk. The field is a complex field that contains the embedding returned by the multimodal Azure AI Vision API, the bounding box, and the URL of the image in Azure Blob Storage.
14+
* **Data ingestion**: In addition to the usual data ingestion flow, the document extraction process will extract images from the documents using Document Intelligence, store the images in Azure Blob Storage with a citation at the top border, and vectorize the images using the Azure AI Vision service.
15+
* **Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to the LLM, and ask it to answer the question based on both kinds of sources.
2216
* **Citations**: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
2317

24-
For more details on how this feature works, read [this blog post](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/integrating-vision-into-rag-applications/4239460) or watch [this video](https://www.youtube.com/live/C3Zq3z4UQm4?si=SSPowBBJoTBKZ9WW&t=89).
25-
2618
## Using the feature
2719

2820
### Prerequisites
@@ -31,21 +23,15 @@ For more details on how this feature works, read [this blog post](https://techco
3123

3224
### Deployment
3325

34-
1. **Enable multimodal capabilities:**
35-
36-
First, make sure you do *not* have integrated vectorization enabled, since that is currently incompatible: (TODO!)
26+
1. **Enable multimodal capabilities**
3727

38-
```shell
39-
azd env set USE_FEATURE_INT_VECTORIZATION false
40-
```
41-
42-
Then set the azd environment variable to enable the multimodal feature:
28+
Set the azd environment variable to enable the multimodal feature:
4329

4430
```shell
4531
azd env set USE_MULTIMODAL true
4632
```
4733

48-
2. **Provision the multimodal resources:**
34+
2. **Provision the multimodal resources**
4935

5036
Either run `azd up` if you haven't run it before, or run `azd provision` to provision the multimodal resources. This will create a new Azure AI Vision account and update the Azure AI Search index to include the new image embedding field.
5137

@@ -73,11 +59,13 @@ For more details on how this feature works, read [this blog post](https://techco
7359
```
7460

7561
4. **Try out the feature:**
76-
![GPT4V configuration screenshot](./images/gpt4v.png)
77-
* Access the developer options in the web app and select "Use GPT vision model".
78-
* New sample questions will show up in the UI that are based on the sample financial document.
79-
* Try out a question and see the answer generated by the GPT vision model.
80-
* Check the 'Thought process' and 'Supporting content' tabs.
62+
63+
![Screenshot of app with Developer Settings open, showing multimodal settings highlighted](./images/multimodal.png)
64+
65+
* If you're using the sample data, try one of the sample questions about the financial documents.
66+
* Check the "Thought process" tab to see how the multimodal approach was used
67+
* Check the "Supporting content" tab to see the text and images that were used to generate the answer.
68+
* Open "Developer settings" and try different options for "Included vector fields" and "LLM input sources" to see how they affect the results.
8169

8270
5. **Customize the multimodal approach:**
8371

@@ -114,3 +102,4 @@ For more details on how this feature works, read [this blog post](https://techco
114102
The agent *will* perform the multimodal vector embedding search, but it will not return images in the response,
115103
so we cannot send the images to the chat completion model.
116104
* This feature *is* compatible with the [reasoning models](./reasoning.md) feature, as long as you use a model that [supports image inputs](https://learn.microsoft.com/azure/ai-services/openai/how-to/reasoning?tabs=python-secure%2Cpy#api--feature-support).
105+
* This feature is *mostly* compatible with [integrated vectorization](./integrated_vectorization.md). The extraction process will not be exactly the same, so the chunks will not be identical, and the extracted images will not contain citations.

tests/test_pdfparser.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ def test_crop_image_from_pdf_page():
5252
assert len(cropped_image_bytes) > 0
5353
assert bbox_pixels is not None
5454
assert len(bbox_pixels) == 4
55+
assert bbox_pixels == (105.86, 204.27, 398.74, 475.36) # Coordinates in pixels
5556

5657
# Verify the output is a valid image
5758
cropped_image = Image.open(io.BytesIO(cropped_image_bytes))
@@ -62,8 +63,6 @@ def test_crop_image_from_pdf_page():
6263
expected_image = Image.open(TEST_DATA_DIR / "Financial Market Analysis Report 2023_page2_figure.png")
6364
assert_image_equal(cropped_image, expected_image)
6465

65-
# TODO: assert bbox pixels too
66-
6766

6867
def test_table_to_html():
6968
table = DocumentTable(

0 commit comments

Comments
 (0)