Skip to content

Commit 0bb3f95

Browse files
authored
Add media description feature using Azure Content Understanding (#2195)
* First pass * CU kinda working * CU integration * Better splitting * Add Bicep * Rm unneeded figures * Remove en-us from URLs * Fix URLs * Remote figures output JSON * Update matrix comments * Make mypy happy * Add same errors to file strategy * Add pymupdf to skip modules for mypy * Output the endpoint from Bicep * 100 percent coverage for mediadescriber.py * Tests added for PDFParser * Fix that tuple type * Add pricing link * Fix content read issue
1 parent e90920f commit 0bb3f95

36 files changed

+962
-65
lines changed

.azdo/pipelines/azure-dev.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ steps:
120120
DEPLOYMENT_TARGET: $(DEPLOYMENT_TARGET)
121121
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: $(AZURE_CONTAINER_APPS_WORKLOAD_PROFILE)
122122
USE_CHAT_HISTORY_BROWSER: $(USE_CHAT_HISTORY_BROWSER)
123+
USE_MEDIA_DESCRIBER_AZURE_CU: $(USE_MEDIA_DESCRIBER_AZURE_CU)
123124
- task: AzureCLI@2
124125
displayName: Deploy Application
125126
inputs:

.github/workflows/azure-dev.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ on:
1313
# To configure required secrets for connecting to Azure, simply run `azd pipeline config`
1414

1515
# Set up permissions for deploying with secretless Azure federated credentials
16-
# https://learn.microsoft.com/en-us/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
16+
# https://learn.microsoft.com/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
1717
permissions:
1818
id-token: write
1919
contents: read
@@ -103,6 +103,7 @@ jobs:
103103
DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
104104
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}
105105
USE_CHAT_HISTORY_BROWSER: ${{ vars.USE_CHAT_HISTORY_BROWSER }}
106+
USE_MEDIA_DESCRIBER_AZURE_CU: ${{ vars.USE_MEDIA_DESCRIBER_AZURE_CU }}
106107
steps:
107108
- name: Checkout
108109
uses: actions/checkout@v4

CONTRIBUTING.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,8 @@ If you followed the steps above to install the pre-commit hooks, then you can ju
122122

123123
When adding new azd environment variables, please remember to update:
124124

125+
1. [main.parameters.json](./infra/main.parameters.json)
126+
1. [appEnvVariables in main.bicep](./infra/main.bicep)
125127
1. App Service's [azure.yaml](./azure.yaml)
126128
1. [ADO pipeline](.azdo/pipelines/azure-dev.yml).
127129
1. [Github workflows](.github/workflows/azure-dev.yml)

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,9 @@ However, you can try the [Azure pricing calculator](https://azure.com/e/e3490de2
9191
- Azure AI Document Intelligence: SO (Standard) tier using pre-built layout. Pricing per document page, sample documents have 261 pages total. [Pricing](https://azure.microsoft.com/pricing/details/form-recognizer/)
9292
- Azure AI Search: Basic tier, 1 replica, free level of semantic search. Pricing per hour. [Pricing](https://azure.microsoft.com/pricing/details/search/)
9393
- Azure Blob Storage: Standard tier with ZRS (Zone-redundant storage). Pricing per storage and read operations. [Pricing](https://azure.microsoft.com/pricing/details/storage/blobs/)
94-
- Azure Cosmos DB: Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
94+
- Azure Cosmos DB: Only provisioned if you enabled [chat history with Cosmos DB](docs/deploy_features.md#enabling-persistent-chat-history-with-azure-cosmos-db). Serverless tier. Pricing per request unit and storage. [Pricing](https://azure.microsoft.com/pricing/details/cosmos-db/)
95+
- Azure AI Vision: Only provisioned if you enabled [GPT-4 with vision](docs/gpt4v.md). Pricing per 1K transactions. [Pricing](https://azure.microsoft.com/pricing/details/cognitive-services/computer-vision/)
96+
- Azure AI Content Understanding: Only provisioned if you enabled [media description](docs/deploy_features.md#enabling-media-description-with-azure-content-understanding). Pricing per 1K images. [Pricing](https://azure.microsoft.com/pricing/details/content-understanding/)
9597
- Azure Monitor: Pay-as-you-go tier. Costs based on data ingested. [Pricing](https://azure.microsoft.com/pricing/details/monitor/)
9698

9799
To reduce costs, you can switch to free SKUs for various services, but those SKUs have limitations.

app/backend/gunicorn.conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
bind = "0.0.0.0"
88

99
timeout = 230
10-
# https://learn.microsoft.com/en-us/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds
10+
# https://learn.microsoft.com/troubleshoot/azure/app-service/web-apps-performance-faqs#why-does-my-request-time-out-after-230-seconds
1111

1212
num_cpus = multiprocessing.cpu_count()
1313
if os.getenv("WEBSITE_SKU") == "LinuxFree":

app/backend/prepdocs.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from azure.core.credentials import AzureKeyCredential
88
from azure.core.credentials_async import AsyncTokenCredential
99
from azure.identity.aio import AzureDeveloperCliCredential, get_bearer_token_provider
10+
from rich.logging import RichHandler
1011

1112
from load_azd_env import load_azd_env
1213
from prepdocslib.blobmanager import BlobManager
@@ -158,8 +159,10 @@ def setup_file_processors(
158159
local_pdf_parser: bool = False,
159160
local_html_parser: bool = False,
160161
search_images: bool = False,
162+
use_content_understanding: bool = False,
163+
content_understanding_endpoint: Union[str, None] = None,
161164
):
162-
sentence_text_splitter = SentenceTextSplitter(has_image_embeddings=search_images)
165+
sentence_text_splitter = SentenceTextSplitter()
163166

164167
doc_int_parser: Optional[DocumentAnalysisParser] = None
165168
# check if Azure Document Intelligence credentials are provided
@@ -170,6 +173,8 @@ def setup_file_processors(
170173
doc_int_parser = DocumentAnalysisParser(
171174
endpoint=f"https://{document_intelligence_service}.cognitiveservices.azure.com/",
172175
credential=documentintelligence_creds,
176+
use_content_understanding=use_content_understanding,
177+
content_understanding_endpoint=content_understanding_endpoint,
173178
)
174179

175180
pdf_parser: Optional[Parser] = None
@@ -294,10 +299,10 @@ async def main(strategy: Strategy, setup_index: bool = True):
294299
args = parser.parse_args()
295300

296301
if args.verbose:
297-
logging.basicConfig(format="%(message)s")
302+
logging.basicConfig(format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)])
298303
# We only set the level to INFO for our logger,
299304
# to avoid seeing the noisy INFO level logs from the Azure SDKs
300-
logger.setLevel(logging.INFO)
305+
logger.setLevel(logging.DEBUG)
301306

302307
load_azd_env()
303308

@@ -309,6 +314,7 @@ async def main(strategy: Strategy, setup_index: bool = True):
309314
use_gptvision = os.getenv("USE_GPT4V", "").lower() == "true"
310315
use_acls = os.getenv("AZURE_ADLS_GEN2_STORAGE_ACCOUNT") is not None
311316
dont_use_vectors = os.getenv("USE_VECTORS", "").lower() == "false"
317+
use_content_understanding = os.getenv("USE_MEDIA_DESCRIBER_AZURE_CU", "").lower() == "true"
312318

313319
# Use the current user identity to connect to Azure services. See infra/main.bicep for role assignments.
314320
if tenant_id := os.getenv("AZURE_TENANT_ID"):
@@ -406,6 +412,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
406412
local_pdf_parser=os.getenv("USE_LOCAL_PDF_PARSER") == "true",
407413
local_html_parser=os.getenv("USE_LOCAL_HTML_PARSER") == "true",
408414
search_images=use_gptvision,
415+
use_content_understanding=use_content_understanding,
416+
content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
409417
)
410418
image_embeddings_service = setup_image_embeddings_service(
411419
azure_credential=azd_credential,
@@ -424,6 +432,8 @@ async def main(strategy: Strategy, setup_index: bool = True):
424432
search_analyzer_name=os.getenv("AZURE_SEARCH_ANALYZER_NAME"),
425433
use_acls=use_acls,
426434
category=args.category,
435+
use_content_understanding=use_content_understanding,
436+
content_understanding_endpoint=os.getenv("AZURE_CONTENTUNDERSTANDING_ENDPOINT"),
427437
)
428438

429439
loop.run_until_complete(main(ingestion_strategy, setup_index=not args.remove and not args.removeall))

app/backend/prepdocslib/blobmanager.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ def sourcepage_from_file_page(cls, filename, page=0) -> str:
171171

172172
@classmethod
173173
def blob_image_name_from_file_page(cls, filename, page=0) -> str:
174-
return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".png"
174+
return os.path.splitext(os.path.basename(filename))[0] + f"-{page+1}" + ".png"
175175

176176
@classmethod
177177
def blob_name_from_file_name(cls, filename) -> str:

app/backend/prepdocslib/filestrategy.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
import logging
22
from typing import List, Optional
33

4+
from azure.core.credentials import AzureKeyCredential
5+
46
from .blobmanager import BlobManager
57
from .embeddings import ImageEmbeddings, OpenAIEmbeddings
68
from .fileprocessor import FileProcessor
79
from .listfilestrategy import File, ListFileStrategy
10+
from .mediadescriber import ContentUnderstandingDescriber
811
from .searchmanager import SearchManager, Section
912
from .strategy import DocumentAction, SearchInfo, Strategy
1013

@@ -50,6 +53,8 @@ def __init__(
5053
search_analyzer_name: Optional[str] = None,
5154
use_acls: bool = False,
5255
category: Optional[str] = None,
56+
use_content_understanding: bool = False,
57+
content_understanding_endpoint: Optional[str] = None,
5358
):
5459
self.list_file_strategy = list_file_strategy
5560
self.blob_manager = blob_manager
@@ -61,6 +66,8 @@ def __init__(
6166
self.search_info = search_info
6267
self.use_acls = use_acls
6368
self.category = category
69+
self.use_content_understanding = use_content_understanding
70+
self.content_understanding_endpoint = content_understanding_endpoint
6471

6572
async def setup(self):
6673
search_manager = SearchManager(
@@ -73,6 +80,16 @@ async def setup(self):
7380
)
7481
await search_manager.create_index()
7582

83+
if self.use_content_understanding:
84+
if self.content_understanding_endpoint is None:
85+
raise ValueError("Content Understanding is enabled but no endpoint was provided")
86+
if isinstance(self.search_info.credential, AzureKeyCredential):
87+
raise ValueError(
88+
"AzureKeyCredential is not supported for Content Understanding, use keyless auth instead"
89+
)
90+
cu_manager = ContentUnderstandingDescriber(self.content_understanding_endpoint, self.search_info.credential)
91+
await cu_manager.create_analyzer()
92+
7693
async def run(self):
7794
search_manager = SearchManager(
7895
self.search_info, self.search_analyzer_name, self.use_acls, False, self.embeddings
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
import logging
2+
from abc import ABC
3+
4+
import aiohttp
5+
from azure.core.credentials_async import AsyncTokenCredential
6+
from azure.identity.aio import get_bearer_token_provider
7+
from rich.progress import Progress
8+
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed
9+
10+
logger = logging.getLogger("scripts")
11+
12+
13+
class MediaDescriber(ABC):
14+
15+
async def describe_image(self, image_bytes) -> str:
16+
raise NotImplementedError # pragma: no cover
17+
18+
19+
class ContentUnderstandingDescriber:
20+
CU_API_VERSION = "2024-12-01-preview"
21+
22+
analyzer_schema = {
23+
"analyzerId": "image_analyzer",
24+
"name": "Image understanding",
25+
"description": "Extract detailed structured information from images extracted from documents.",
26+
"baseAnalyzerId": "prebuilt-image",
27+
"scenario": "image",
28+
"config": {"returnDetails": False},
29+
"fieldSchema": {
30+
"name": "ImageInformation",
31+
"descriptions": "Description of image.",
32+
"fields": {
33+
"Description": {
34+
"type": "string",
35+
"description": "Description of the image. If the image has a title, start with the title. Include a 2-sentence summary. If the image is a chart, diagram, or table, include the underlying data in an HTML table tag, with accurate numbers. If the image is a chart, describe any axis or legends. The only allowed HTML tags are the table/thead/tr/td/tbody tags.",
36+
},
37+
},
38+
},
39+
}
40+
41+
def __init__(self, endpoint: str, credential: AsyncTokenCredential):
42+
self.endpoint = endpoint
43+
self.credential = credential
44+
45+
async def poll_api(self, session, poll_url, headers):
46+
47+
@retry(stop=stop_after_attempt(60), wait=wait_fixed(2), retry=retry_if_exception_type(ValueError))
48+
async def poll():
49+
async with session.get(poll_url, headers=headers) as response:
50+
response.raise_for_status()
51+
response_json = await response.json()
52+
if response_json["status"] == "Failed":
53+
raise Exception("Failed")
54+
if response_json["status"] == "Running":
55+
raise ValueError("Running")
56+
return response_json
57+
58+
return await poll()
59+
60+
async def create_analyzer(self):
61+
logger.info("Creating analyzer '%s'...", self.analyzer_schema["analyzerId"])
62+
63+
token_provider = get_bearer_token_provider(self.credential, "https://cognitiveservices.azure.com/.default")
64+
token = await token_provider()
65+
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
66+
params = {"api-version": self.CU_API_VERSION}
67+
analyzer_id = self.analyzer_schema["analyzerId"]
68+
cu_endpoint = f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_id}"
69+
async with aiohttp.ClientSession() as session:
70+
async with session.put(
71+
url=cu_endpoint, params=params, headers=headers, json=self.analyzer_schema
72+
) as response:
73+
if response.status == 409:
74+
logger.info("Analyzer '%s' already exists.", analyzer_id)
75+
return
76+
elif response.status != 201:
77+
data = await response.text()
78+
raise Exception("Error creating analyzer", data)
79+
else:
80+
poll_url = response.headers.get("Operation-Location")
81+
82+
with Progress() as progress:
83+
progress.add_task("Creating analyzer...", total=None, start=False)
84+
await self.poll_api(session, poll_url, headers)
85+
86+
async def describe_image(self, image_bytes: bytes) -> str:
87+
logger.info("Sending image to Azure Content Understanding service...")
88+
async with aiohttp.ClientSession() as session:
89+
token = await self.credential.get_token("https://cognitiveservices.azure.com/.default")
90+
headers = {"Authorization": "Bearer " + token.token}
91+
params = {"api-version": self.CU_API_VERSION}
92+
analyzer_name = self.analyzer_schema["analyzerId"]
93+
async with session.post(
94+
url=f"{self.endpoint}/contentunderstanding/analyzers/{analyzer_name}:analyze",
95+
params=params,
96+
headers=headers,
97+
data=image_bytes,
98+
) as response:
99+
response.raise_for_status()
100+
poll_url = response.headers["Operation-Location"]
101+
102+
with Progress() as progress:
103+
progress.add_task("Processing...", total=None, start=False)
104+
results = await self.poll_api(session, poll_url, headers)
105+
106+
fields = results["result"]["contents"][0]["fields"]
107+
return fields["Description"]["valueString"]

app/backend/prepdocslib/page.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ class Page:
33
A single page from a document
44
55
Attributes:
6-
page_num (int): Page number
6+
page_num (int): Page number (0-indexed)
77
offset (int): If the text of the entire Document was concatenated into a single string, the index of the first character on the page. For example, if page 1 had the text "hello" and page 2 had the text "world", the offset of page 2 is 5 ("hellow")
88
text (str): The text of the page
99
"""
@@ -17,6 +17,10 @@ def __init__(self, page_num: int, offset: int, text: str):
1717
class SplitPage:
1818
"""
1919
A section of a page that has been split into a smaller chunk.
20+
21+
Attributes:
22+
page_num (int): Page number (0-indexed)
23+
text (str): The text of the section
2024
"""
2125

2226
def __init__(self, page_num: int, text: str):

0 commit comments

Comments
 (0)