Skip to content

Commit adf0353

Browse files
committed
2 parents 6fac970 + a967edf commit adf0353

File tree

8 files changed

+99
-19
lines changed

8 files changed

+99
-19
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -266,6 +266,7 @@ You can find extensive documentation in the [docs](docs/README.md) folder:
266266

267267
### Resources
268268

269+
- [📖 Docs: Get started using the chat with your data sample](https://learn.microsoft.com/azure/developer/python/get-started-app-chat-template?toc=%2Fazure%2Fdeveloper%2Fai%2Ftoc.json&bc=%2Fazure%2Fdeveloper%2Fai%2Fbreadcrumb%2Ftoc.json&tabs=github-codespaces)
269270
- [📖 Blog: Revolutionize your Enterprise Data with ChatGPT: Next-gen Apps w/ Azure OpenAI and AI Search](https://techcommunity.microsoft.com/blog/azure-ai-services-blog/revolutionize-your-enterprise-data-with-chatgpt-next-gen-apps-w-azure-openai-and/3762087)
270271
- [📖 Docs: Azure AI Search](https://learn.microsoft.com/azure/search/search-what-is-azure-search)
271272
- [📖 Docs: Azure OpenAI Service](https://learn.microsoft.com/azure/cognitive-services/openai/overview)

app/backend/prepdocslib/pdfparser.py

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
)
1515
from azure.core.credentials import AzureKeyCredential
1616
from azure.core.credentials_async import AsyncTokenCredential
17+
from azure.core.exceptions import HttpResponseError
1718
from PIL import Image
1819
from pypdf import PdfReader
1920

@@ -68,6 +69,7 @@ async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
6869
async with DocumentIntelligenceClient(
6970
endpoint=self.endpoint, credential=self.credential
7071
) as document_intelligence_client:
72+
file_analyzed = False
7173
if self.use_content_understanding:
7274
if self.content_understanding_endpoint is None:
7375
raise ValueError("Content Understanding is enabled but no endpoint was provided")
@@ -77,15 +79,29 @@ async def parse(self, content: IO) -> AsyncGenerator[Page, None]:
7779
)
7880
cu_describer = ContentUnderstandingDescriber(self.content_understanding_endpoint, self.credential)
7981
content_bytes = content.read()
80-
poller = await document_intelligence_client.begin_analyze_document(
81-
model_id="prebuilt-layout",
82-
analyze_request=AnalyzeDocumentRequest(bytes_source=content_bytes),
83-
output=["figures"],
84-
features=["ocrHighResolution"],
85-
output_content_format="markdown",
86-
)
87-
doc_for_pymupdf = pymupdf.open(stream=io.BytesIO(content_bytes))
88-
else:
82+
try:
83+
poller = await document_intelligence_client.begin_analyze_document(
84+
model_id="prebuilt-layout",
85+
analyze_request=AnalyzeDocumentRequest(bytes_source=content_bytes),
86+
output=["figures"],
87+
features=["ocrHighResolution"],
88+
output_content_format="markdown",
89+
)
90+
doc_for_pymupdf = pymupdf.open(stream=io.BytesIO(content_bytes))
91+
file_analyzed = True
92+
except HttpResponseError as e:
93+
content.seek(0)
94+
if e.error and e.error.code == "InvalidArgument":
95+
logger.error(
96+
"This document type does not support media description. Proceeding with standard analysis."
97+
)
98+
else:
99+
logger.error(
100+
"Unexpected error analyzing document for media description: %s. Proceeding with standard analysis.",
101+
e,
102+
)
103+
104+
if file_analyzed is False:
89105
poller = await document_intelligence_client.begin_analyze_document(
90106
model_id=self.model_id, analyze_request=content, content_type="application/octet-stream"
91107
)

app/backend/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ msal==1.30.0
197197
# msal-extensions
198198
msal-extensions==1.2.0
199199
# via azure-identity
200-
msgraph-core==1.1.8
200+
msgraph-core==1.1.7
201201
# via msgraph-sdk
202202
msgraph-sdk==1.16.0
203203
# via -r requirements.in

app/frontend/package-lock.json

Lines changed: 4 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

app/frontend/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
"dompurify": "^3.2.0",
2222
"i18next": "^24.2.0",
2323
"i18next-browser-languagedetector": "^8.0.0",
24-
"i18next-http-backend": "^2.5.2",
24+
"i18next-http-backend": "^3.0.1",
2525
"idb": "^8.0.0",
2626
"ndjson-readablestream": "^1.2.0",
2727
"react": "^18.3.1",

docs/azd.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The `azd up` command uses the `azure.yaml` file combined with the infrastructure
1717

1818
Next, it provisions the resources based on `main.bicep` and `main.parameters.json`. At that point, since there is no default value for the OpenAI resource location, it asks you to pick a location from a short list of available regions. Then it will send requests to Azure to provision all the required resources. With everything provisioned, it runs the `postprovision` hook to process the local data and add it to an Azure AI Search index.
1919

20-
Finally, it looks at `azure.yaml` to determine the Azure host (appservice, in this case) and uploads the zip to Azure App Service. The `azd up` command is now complete, but it may take another 5-10 minutes for the App Service app to be fully available and working, especially for the initial deploy.
20+
Finally, it looks at `azure.yaml` to determine the Azure host and uploads the zip to Azure App Service. The `azd up` command is now complete, but it may take another 5-10 minutes for the App Service app to be fully available and working, especially for the initial deploy.
2121

2222
Related commands are `azd provision` for just provisioning (if infra files change) and `azd deploy` for just deploying updated app code.
2323

docs/deploy_features.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,6 @@ By default, if your documents contain image-like figures, the data ingestion pro
163163
so users will not be able to ask questions about them.
164164

165165
You can optionably enable the description of media content using Azure Content Understanding. When enabled, the data ingestion process will send figures to Azure Content Understanding and replace the figure with the description in the indexed document.
166-
To learn more about this process and compare it to the gpt-4 vision integration, see [this guide](./data_ingestion.md#media-description).
167166

168167
To enable media description with Azure Content Understanding, run:
169168

@@ -175,6 +174,9 @@ If you have already run `azd up`, you will need to run `azd provision` to create
175174
If you have already indexed your documents and want to re-index them with the media descriptions,
176175
first [remove the existing documents](./data_ingestion.md#removing-documents) and then [re-ingest the data](./data_ingestion.md#indexing-additional-documents).
177176

177+
⚠️ This feature does not yet support DOCX, PPTX, or XLSX formats. If you have figures in those formats, they will be ignored.
178+
Convert them first to PDF or image formats to enable media description.
179+
178180
## Enabling client-side chat history
179181

180182
This feature allows users to view the chat history of their conversation, stored in the browser using [IndexedDB](https://developer.mozilla.org/docs/Web/API/IndexedDB_API). That means the chat history will be available only on the device where the chat was initiated. To enable browser-stored chat history, run:

tests/test_pdfparser.py

Lines changed: 63 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
import io
2+
import json
23
import logging
34
import math
45
import pathlib
5-
from unittest.mock import AsyncMock, MagicMock
6+
from unittest.mock import AsyncMock, MagicMock, Mock
67

78
import pymupdf
89
import pytest
@@ -17,6 +18,7 @@
1718
DocumentTable,
1819
DocumentTableCell,
1920
)
21+
from azure.core.exceptions import HttpResponseError
2022
from PIL import Image, ImageChops
2123

2224
from prepdocslib.mediadescriber import ContentUnderstandingDescriber
@@ -308,3 +310,63 @@ async def mock_describe_image(self, image_bytes):
308310
pages[0].text
309311
== "# Simple Figure\n\nThis text is before the figure and NOT part of it.\n\n\n<figure><figcaption>Figure 1<br>Pie chart</figcaption></figure>\n\n\nThis is text after the figure that's not part of it."
310312
)
313+
314+
315+
@pytest.mark.asyncio
316+
async def test_parse_unsupportedformat(monkeypatch, caplog):
317+
mock_poller = MagicMock()
318+
319+
async def mock_begin_analyze_document(self, model_id, analyze_request, **kwargs):
320+
321+
if kwargs.get("features") == ["ocrHighResolution"]:
322+
323+
class FakeErrorOne:
324+
def __init__(self):
325+
self.error = Mock(message="A fake error", code="FakeErrorOne")
326+
327+
class FakeHttpResponse(HttpResponseError):
328+
def __init__(self, response, error, *args, **kwargs):
329+
self.error = error
330+
super().__init__(self, response=response, *args, **kwargs)
331+
332+
message = {
333+
"error": {
334+
"code": "InvalidArgument",
335+
"message": "A fake error",
336+
}
337+
}
338+
response = Mock(status_code=500, headers={})
339+
response.text = lambda encoding=None: json.dumps(message).encode("utf-8")
340+
response.headers["content-type"] = "application/json"
341+
response.content_type = "application/json"
342+
raise FakeHttpResponse(response, FakeErrorOne())
343+
else:
344+
return mock_poller
345+
346+
async def mock_poller_result():
347+
return AnalyzeResult(
348+
content="Page content",
349+
pages=[DocumentPage(page_number=1, spans=[DocumentSpan(offset=0, length=12)])],
350+
tables=[],
351+
figures=[],
352+
)
353+
354+
monkeypatch.setattr(DocumentIntelligenceClient, "begin_analyze_document", mock_begin_analyze_document)
355+
monkeypatch.setattr(mock_poller, "result", mock_poller_result)
356+
357+
parser = DocumentAnalysisParser(
358+
endpoint="https://example.com",
359+
credential=MockAzureCredential(),
360+
use_content_understanding=True,
361+
content_understanding_endpoint="https://example.com",
362+
)
363+
content = io.BytesIO(b"pdf content bytes")
364+
content.name = "test.docx"
365+
with caplog.at_level(logging.ERROR):
366+
pages = [page async for page in parser.parse(content)]
367+
assert "This document type does not support media description." in caplog.text
368+
369+
assert len(pages) == 1
370+
assert pages[0].page_num == 0
371+
assert pages[0].offset == 0
372+
assert pages[0].text == "Page content"

0 commit comments

Comments
 (0)