Address more TODOs

pamelafox · pamelafox · commit 659d401e5511 · 2025-07-15T10:30:16.000-07:00
diff --git a/app/backend/prepdocslib/pdfparser.py b/app/backend/prepdocslib/pdfparser.py
@@ -299,7 +299,7 @@ def crop_image_from_pdf_page(
         # Scale the bounding box to 72 DPI
         bbox_dpi = 72
         # We multiply using unpacking to ensure the resulting tuple has the correct number of elements
-        x0, y0, x1, y1 = (x * bbox_dpi for x in bbox_inches)
+        x0, y0, x1, y1 = (round(x * bbox_dpi, 2) for x in bbox_inches)
         bbox_pixels = (x0, y0, x1, y1)
         rect = pymupdf.Rect(bbox_pixels)
         # Assume that the PDF has 300 DPI,
diff --git a/data/Multimodal_Examples/Financial Market Analysis Report 2023.pdf b/data/Multimodal_Examples/Financial Market Analysis Report 2023.pdf
diff --git a/docs/customization.md b/docs/customization.md
@@ -40,42 +40,37 @@ Typically, the primary backend code you'll want to customize is the `app/backend
 
 The chat tab uses the approach programmed in [chatreadretrieveread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/chatreadretrieveread.py).
 
-1. It calls the OpenAI ChatCompletion API to turn the user question into a good search query, using the prompt and tools from [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty).
-2. It queries Azure AI Search for search results for that query (optionally using the vector embeddings for that query).
-3. It then calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty). That call includes the past message history as well (or as many messages fit inside the model's token limit).
+1. **Query rewriting**: It calls the OpenAI ChatCompletion API to turn the user question into a good search query, using the prompt and tools from [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty).
+2. **Search**: It queries Azure AI Search for search results for that query (optionally using the vector embeddings for that query).
+3. **Answering**: It then calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty). That call includes the past message history as well (or as many messages fit inside the model's token limit).
 
 The prompts are currently tailored to the sample data since they start with "Assistant helps the company employees with their healthcare plan questions, and questions about the employee handbook." Modify the [chat_query_rewrite.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_query_rewrite.prompty) and [chat_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question.prompty) prompts to match your data.
 
-##### Chat with vision
+##### Chat with multimodal feature
 
-TODO FIX THIS!
+If you followed the instructions in [the multimodal guide](multimodal.md) to enable multimodal RAG,
+there are several differences in the chat approach:
 
-If you followed the instructions in [the multimodal guide](multimodal.md) to enable the vision approach and the "Use GPT vision model" option is selected, then the chat tab will use the `chatreadretrievereadvision.py` approach instead. This approach is similar to the `chatreadretrieveread.py` approach, with a few differences:
-
-1. Step 1 is the same as before, except it uses the GPT-4 Vision model instead of the default GPT-3.5 model.
-2. For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the `imageEmbeddings` fields in the indexed documents. For each matching document, it downloads the image blob and converts it to a base 64 encoding.
-3. When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the GPT4 Vision model (similar to this [documentation example](https://platform.openai.com/docs/guides/vision/quick-start)). The model generates a response that includes citations to the images, and the UI renders the base64 encoded images when a citation is clicked.
-
-The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd.". Modify the [chat_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/chat_answer_question_vision.prompty) prompt to match your data.
+1. **Query rewriting**: Unchanged.
+2. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
+3. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
 
 #### Ask tab
 
 The ask tab uses the approach programmed in [retrievethenread.py](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/retrievethenread.py).
 
-1. It queries Azure AI Search for search results for the user question (optionally using the vector embeddings for that question).
-2. It then combines the search results and user question, and calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty).
+1. **Search**: It queries Azure AI Search for search results for the user question (optionally using the vector embeddings for that question).
+2. **Answering**: It then combines the search results and user question, and calls the OpenAI ChatCompletion API to answer the question based on the sources, using the prompt from [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty).
 
 The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping Contoso Inc employees with their healthcare plan questions and employee handbook questions." Modify [ask_answer_question.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question.prompty) to match your data.
 
-#### Ask with vision
-
-TODO FIX THIS!
-If you followed the instructions in [the multimodal guide](multimodal.md) to enable the vision approach and the "Use GPT vision model" option is selected, then the ask tab will use the `retrievethenreadvision.py` approach instead. This approach is similar to the `retrievethenread.py` approach, with a few differences:
+#### Ask with multimodal feature
 
-1. For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the `imageEmbeddings` fields in the indexed documents. For each matching document, it downloads the image blob and converts it to a base 64 encoding.
-2. When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the GPT4 Vision model (similar to this [documentation example](https://platform.openai.com/docs/guides/vision/quick-start)). The model generates a response that includes citations to the images, and the UI renders the base64 encoded images when a citation is clicked.
+If you followed the instructions in [the multimodal guide](multimodal.md) to enable multimodal RAG,
+there are several differences in the ask approach:
 
-The prompt for step 2 is currently tailored to the sample data since it starts with "You are an intelligent assistant helping analyze the Annual Financial Report of Contoso Ltd". Modify the [ask_answer_question_vision.prompty](https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/app/backend/approaches/prompts/ask_answer_question_vision.prompty) prompt to match your data.
+1. **Search**: For this step, it also calculates a vector embedding for the user question using [the Azure AI Vision vectorize text API](https://learn.microsoft.com/azure/ai-services/computer-vision/how-to/image-retrieval#call-the-vectorize-text-api), and passes that to the Azure AI Search to compare against the image embedding fields in the indexed documents. For each matching document, it downloads each associated image from Azure Blob Storage and converts it to a base 64 encoding.
+2. **Answering**: When it combines the search results and user question, it includes the base 64 encoded images, and sends along both the text and images to the multimodal LLM. The model generates a response that includes citations to the images, and the UI renders the images when a citation is clicked.
 
 #### Making settings overrides permanent
 
diff --git a/docs/deploy_features.md b/docs/deploy_features.md
@@ -219,7 +219,7 @@ If you have already deployed:
 
 ## Enabling multimodal embeddings and answering
 
-⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization). TODO:
+⚠️ This feature is not currently compatible with [agentic retrieval](./agentic_retrieval.md).
 
 When your documents include images, you can optionally enable this feature that can
 use image embeddings when searching and also use images when answering questions.
@@ -229,8 +229,8 @@ Learn more in the [multimodal guide](./multimodal.md).
 ## Enabling media description with Azure Content Understanding
 
 ⚠️ This feature is not currently compatible with [integrated vectorization](#enabling-integrated-vectorization).
-
-It is compatible with the [multimodal feature](./multimodal.md), but the features provide similar functionality. TODO: UPDATE
+It is compatible with the [multimodal feature](./multimodal.md), but this feature enables only a subset of multimodal capabilities,
+so you may want to enable the multimodal feature instead or as well.
 
 By default, if your documents contain image-like figures, the data ingestion process will ignore those figures,
 so users will not be able to ask questions about them.
@@ -318,8 +318,6 @@ azd env set USE_SPEECH_OUTPUT_BROWSER true
 
 ## Enabling Integrated Vectorization
 
-⚠️ This feature is not currently compatible with the [multimodal feature](./multimodal.md). TODO: UPDATE
-
 Azure AI search recently introduced an [integrated vectorization feature in preview mode](https://techcommunity.microsoft.com/blog/azure-ai-services-blog/announcing-the-public-preview-of-integrated-vectorization-in-azure-ai-search/3960809). This feature is a cloud-based approach to data ingestion, which takes care of document format cracking, data extraction, chunking, vectorization, and indexing, all with Azure technologies.
 
 To enable integrated vectorization with this sample:
diff --git a/docs/images/multimodal.png b/docs/images/multimodal.png
diff --git a/docs/multimodal.md b/docs/multimodal.md
@@ -1,28 +1,20 @@
-# RAG chat: Using GPT vision model with RAG approach
+# RAG chat: Support for multimodal documents
 
-TODO: UPDATE THIS!
-
-[📺 Watch: (RAG Deep Dive series) Multimedia data ingestion](https://www.youtube.com/watch?v=5FfIy7G2WW0)
-
-This repository includes an optional feature that uses the GPT vision model to generate responses based on retrieved content. This feature is useful for answering questions based on the visual content of documents, such as photos and charts.
-
-## How it works
+This repository includes an optional feature that uses multimodal embedding models and multimodal chat completion models
+to better handle documents that contain images, such as financial reports with charts and graphs.
 
 With this feature enabled, the data ingestion process will extract images from your documents
 using Document Intelligence, store the images in Azure Blob Storage, vectorize the images using the Azure AI Vision service, and store the image embeddings in the Azure AI Search index.
 
 During the RAG flow, the app will perform a multi-vector query using both text and image embeddings, and then send any images associated with the retrieved document chunks to the chat completion model for answering questions. This feature assumes that your chat completion model supports multimodal inputs, such as `gpt-4o` or `gpt-4o-mini`.
 
-----OLD----
-When this feature is enabled, the following changes are made to the application:
+With this feature enabled, the following changes are made:
 
-* **Search index**: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings).
-* **Data ingestion**: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index.
-* **Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources.
+* **Search index**: We add a new field "images" to the Azure AI Search index to store information about the images associated with a chunk. The field is a complex field that contains the embedding returned by the multimodal Azure AI Vision API, the bounding box, and the URL of the image in Azure Blob Storage.
+* **Data ingestion**: In addition to the usual data ingestion flow, the document extraction process will extract images from the documents using Document Intelligence, store the images in Azure Blob Storage with a citation at the top border, and vectorize the images using the Azure AI Vision service.
+* **Question answering**: We search the index using both the text and multimodal embeddings. We send both the text and the image to the LLM, and ask it to answer the question based on both kinds of sources.
 * **Citations**: The frontend displays both image sources and text sources, to help users understand how the answer was generated.
 
-For more details on how this feature works, read [this blog post](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/integrating-vision-into-rag-applications/4239460) or watch [this video](https://www.youtube.com/live/C3Zq3z4UQm4?si=SSPowBBJoTBKZ9WW&t=89).
-
 ## Using the feature
 
 ### Prerequisites
@@ -31,21 +23,15 @@ For more details on how this feature works, read [this blog post](https://techco
 
 ### Deployment
 
-1. **Enable multimodal capabilities:**
-
-   First, make sure you do *not* have integrated vectorization enabled, since that is currently incompatible: (TODO!)
+1. **Enable multimodal capabilities**
 
-   ```shell
-   azd env set USE_FEATURE_INT_VECTORIZATION false
-   ```
-
-   Then set the azd environment variable to enable the multimodal feature:
+   Set the azd environment variable to enable the multimodal feature:
 
    ```shell
    azd env set USE_MULTIMODAL true
    ```
 
-2. **Provision the multimodal resources:**
+2. **Provision the multimodal resources**
 
    Either run `azd up` if you haven't run it before, or run `azd provision` to provision the multimodal resources. This will create a new Azure AI Vision account and update the Azure AI Search index to include the new image embedding field.
 
@@ -73,11 +59,13 @@ For more details on how this feature works, read [this blog post](https://techco
    ```
 
 4. **Try out the feature:**
-    ![GPT4V configuration screenshot](./images/gpt4v.png)
-   * Access the developer options in the web app and select "Use GPT vision model".
-   * New sample questions will show up in the UI that are based on the sample financial document.
-   * Try out a question and see the answer generated by the GPT vision model.
-   * Check the 'Thought process' and 'Supporting content' tabs.
+
+   ![Screenshot of app with Developer Settings open, showing multimodal settings highlighted](./images/multimodal.png)
+
+   * If you're using the sample data, try one of the sample questions about the financial documents.
+   * Check the "Thought process" tab to see how the multimodal approach was used
+   * Check the "Supporting content" tab to see the text and images that were used to generate the answer.
+   * Open "Developer settings" and try different options for "Included vector fields" and "LLM input sources" to see how they affect the results.
 
 5. **Customize the multimodal approach:**
 
@@ -114,3 +102,4 @@ For more details on how this feature works, read [this blog post](https://techco
 The agent *will* perform the multimodal vector embedding search, but it will not return images in the response,
 so we cannot send the images to the chat completion model.
 * This feature *is* compatible with the [reasoning models](./reasoning.md) feature, as long as you use a model that [supports image inputs](https://learn.microsoft.com/azure/ai-services/openai/how-to/reasoning?tabs=python-secure%2Cpy#api--feature-support).
+* This feature is *mostly* compatible with [integrated vectorization](./integrated_vectorization.md). The extraction process will not be exactly the same, so the chunks will not be identical, and the extracted images will not contain citations.
diff --git a/tests/test_pdfparser.py b/tests/test_pdfparser.py
@@ -52,6 +52,7 @@ def test_crop_image_from_pdf_page():
     assert len(cropped_image_bytes) > 0
     assert bbox_pixels is not None
     assert len(bbox_pixels) == 4
+    assert bbox_pixels == (105.86, 204.27, 398.74, 475.36)  # Coordinates in pixels
 
     # Verify the output is a valid image
     cropped_image = Image.open(io.BytesIO(cropped_image_bytes))
@@ -62,8 +63,6 @@ def test_crop_image_from_pdf_page():
     expected_image = Image.open(TEST_DATA_DIR / "Financial Market Analysis Report 2023_page2_figure.png")
     assert_image_equal(cropped_image, expected_image)
 
-    # TODO: assert bbox pixels too
-
 
 def test_table_to_html():
     table = DocumentTable(