Skip to content

Commit c5ad17f

Browse files
revise for comments
1 parent 77eaa49 commit c5ad17f

File tree

2 files changed

+29
-27
lines changed

2 files changed

+29
-27
lines changed

notebooks/content_understanding_pro_mode.ipynb

Lines changed: 16 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@
1313
">\n",
1414
"> #################################################################################\n",
1515
"\n",
16-
"This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning, and complex decision-making (for instance, identifying inconsistencies, drawing inferences, and making sophisticated decisions). The pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.\n",
16+
"This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning, and complex decision-making (for instance, identifying inconsistencies, drawing inferences, and making sophisticated decisions). Pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.\n",
1717
"\n",
1818
"In this walkthrough, you'll learn how to:\n",
19-
"1. Create an analyzer with reference data.\n",
19+
"1. Create an analyzer with a schema and reference data.\n",
2020
"2. Analyze your files using Pro mode.\n",
2121
"\n",
2222
"For more details on Pro mode, see the [Azure AI Content Understanding: Standard and Pro Modes](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/standard-pro-modes) documentation."
@@ -48,9 +48,9 @@
4848
"metadata": {},
4949
"source": [
5050
"## Analyzer template and local files setup\n",
51-
"- **analyzer_template**: In this sample we define an analyzer template for invoice contract verification.\n",
52-
"- **input_docs**: We can have multiple input document files in one folder or designate a single document file location \n",
53-
"- **reference_dos**: During analyzer creation, we can provide documents that can aid in providing context that references the service at inference time. We will get ocr results for these files if needed, generate a reference jsonl file, and upload these files to a designated Azure blob storage.\n",
51+
"- **analyzer_template**: In this sample we define an analyzer template for invoice-contract verification.\n",
52+
"- **input_docs**: We can have multiple input document files in one folder or designate a single document file location. \n",
53+
"- **reference_docs**: During analyzer creation, we can provide documents that can aid in providing context that the analyzer references at inference time. We will get OCR results for these files if needed, generate a reference JSONL file, and upload these files to a designated Azure blob storage.\n",
5454
"\n",
5555
"> For example, if you're looking to analyze invoices to ensure they're consistent with a contractual agreement, you can supply the invoice and other relevant documents (for example, a purchase order) as inputs, and supply the contract files as reference data. The service applies reasoning to validate the input documents according to your schema, which might be to identify discrepancies to flag for further review."
5656
]
@@ -72,7 +72,7 @@
7272
"metadata": {},
7373
"source": [
7474
"## Create Azure content understanding client\n",
75-
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility Class which contain the functions to interact with the Content Understanding server. Before Content Understanding SDK release, we can regard it as a lightweight SDK. Fill the constant **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n",
75+
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n",
7676
"\n",
7777
"> ⚠️ Important:\n",
7878
"You must update the code below to match your Azure authentication method.\n",
@@ -134,7 +134,7 @@
134134
"- Generate a reference `.jsonl` file.\n",
135135
"- Upload these files to the designated Azure blob storage.\n",
136136
"\n",
137-
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the prerequisite step."
137+
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step."
138138
]
139139
},
140140
{
@@ -155,9 +155,9 @@
155155
"metadata": {},
156156
"source": [
157157
"## Create analyzer with defined schema for Pro mode\n",
158-
"Before creating the custom fields analyzer, you should fill the constant ANALYZER_ID with a business-related name. Here we randomly generate a name for demo purpose.\n",
158+
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
159159
"\n",
160-
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in `.env` file and used in the previous step."
160+
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the `.env` file and used in the previous step."
161161
]
162162
},
163163
{
@@ -177,21 +177,22 @@
177177
")\n",
178178
"result = client.poll_result(response)\n",
179179
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
180-
" logging.info(f\"Here is the analyzer detail for {result['result']['analyzerId']}\")\n",
180+
" logging.info(f\"Analyzer details for {result['result']['analyzerId']}\")\n",
181181
" logging.info(json.dumps(result, indent=2))\n",
182182
"else:\n",
183-
" logging.info(\n",
184-
" \"Check your service please, may be some issues in configuration and deployment\"\n",
183+
" logging.warning(\n",
184+
" \"An issue was encountered when trying to create the analyzer. \"\n",
185+
" \"Please double-check your deployment and configurations for potential problems.\"\n",
185186
" )"
186187
]
187188
},
188189
{
189190
"cell_type": "markdown",
190191
"metadata": {},
191192
"source": [
192-
"## Use created analyzer to do data analysis of the input documents\n",
193+
"## Use created analyzer to analyze the input documents\n",
193194
"After the analyzer is successfully created, we can use it to analyze our input files.\n",
194-
"> NOTE: Pro mode does multi-step reasoning and may take longer time to analyze."
195+
"> NOTE: Pro mode does multi-step reasoning and may take a longer time to analyze."
195196
]
196197
},
197198
{
@@ -211,7 +212,7 @@
211212
"metadata": {},
212213
"source": [
213214
"## Delete exist analyzer in Content Understanding Service\n",
214-
"This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. The custom fields analyzer could be stored in your service for reusing by subsequent business in real usage scenarios."
215+
"This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse."
215216
]
216217
},
217218
{

python/content_understanding_client.py

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
class AzureContentUnderstandingClient:
1616

1717
PREBUILT_DOCUMENT_ANALYZER_ID: str = "prebuilt-documentAnalyzer"
18-
RESULT_SUFFIX: str = ".result.json"
19-
SOURCES_JSONL: str = "sources.jsonl"
18+
OCR_RESULT_FILE_SUFFIX: str = ".result.json"
19+
KNOWLEDGE_SOURCE_LIST_FILE_NAME: str = "sources.jsonl"
2020

2121
# https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/service-limits#document-and-text
2222
SUPPORTED_FILE_TYPES: List[str] = [
@@ -98,7 +98,7 @@ def _get_pro_mode_reference_docs_config(
9898
"kind": "reference",
9999
"containerUrl": storage_container_sas_url,
100100
"prefix": storage_container_path_prefix,
101-
"fileListPath": self.SOURCES_JSONL,
101+
"fileListPath": self.KNOWLEDGE_SOURCE_LIST_FILE_NAME,
102102
}]
103103

104104
def _get_classifier_url(self, endpoint: str, api_version: str, classifier_id: str) -> str:
@@ -133,7 +133,7 @@ def is_supported_type_by_file_path(file_path: Path, is_pro_mode: bool=False) ->
133133
134134
Args:
135135
file_path (Path): The path to the file to check.
136-
is_pro_mode (bool): If True, checks against pro mode supported file types.
136+
is_pro_mode (bool): If True, checks against Pro mode supported file types.
137137
138138
Returns:
139139
bool: True if the file type is supported, False otherwise.
@@ -154,7 +154,7 @@ def is_supported_type_by_file_ext(file_ext: str, is_pro_mode: bool=False) -> boo
154154
155155
Args:
156156
file_ext (str): The file extension to check.
157-
is_pro_mode (bool): If True, checks against pro mode supported file types.
157+
is_pro_mode (bool): If True, checks against Pro mode supported file types.
158158
159159
Returns:
160160
bool: True if the file type is supported, False otherwise.
@@ -311,7 +311,7 @@ def begin_analyze(self, analyzer_id: str, file_location: str) -> Response:
311311
file_path = Path(file_location)
312312
if file_path.exists():
313313
if file_path.is_dir():
314-
# Only pro mode supports multiple input files
314+
# Only Pro mode supports multiple input files
315315
data = {
316316
"inputs": [
317317
{
@@ -359,7 +359,7 @@ def begin_analyze(self, analyzer_id: str, file_location: str) -> Response:
359359
)
360360
return response
361361

362-
def get_analyze_result(self, file_location: str) -> Dict[str, Any]:
362+
def get_prebuilt_document_analyze_result(self, file_location: str) -> Dict[str, Any]:
363363
response = self.begin_analyze(
364364
analyzer_id=self.PREBUILT_DOCUMENT_ANALYZER_ID,
365365
file_location=file_location,
@@ -392,25 +392,25 @@ async def upload_jsonl_to_blob(
392392

393393
async def generate_knowledge_base_on_blob(
394394
self,
395-
referemce_docs_folder: str,
395+
reference_docs_folder: str,
396396
storage_container_sas_url: str,
397397
storage_container_path_prefix: str,
398398
skip_analyze: bool = False,
399399
) -> None:
400400
container_client = ContainerClient.from_container_url(storage_container_sas_url)
401401
resources = []
402-
for dirpath, _, filenames in os.walk(referemce_docs_folder):
402+
for dirpath, _, filenames in os.walk(reference_docs_folder):
403403
for filename in filenames:
404404
filename_no_ext, file_ext = os.path.splitext(filename)
405405
if self.is_supported_type_by_file_ext(file_ext, is_pro_mode=True):
406406
file_path = os.path.join(dirpath, filename)
407-
result_file_name = filename_no_ext + self.RESULT_SUFFIX
407+
result_file_name = filename_no_ext + self.OCR_RESULT_FILE_SUFFIX
408408
result_file_blob_path = storage_container_path_prefix + result_file_name
409409
# Get and upload result.json
410410
if not skip_analyze:
411411
self._logger.info(f"Analyzing result for {filename}")
412412
try:
413-
analyze_result = self.get_analyze_result(file_path)
413+
analyze_result = self.get_prebuilt_document_analyze_result(file_path)
414414
except Exception as e:
415415
self._logger.error(f"Error of getting analyze result of {filename}: {e}")
416416
continue
@@ -427,7 +427,8 @@ async def generate_knowledge_base_on_blob(
427427
await self._upload_file_to_blob(container_client, file_path, file_blob_path)
428428
resources.append({"file": filename, "resultFile": result_file_name})
429429
# Upload sources.jsonl
430-
await self.upload_jsonl_to_blob(container_client, resources, storage_container_path_prefix + self.SOURCES_JSONL)
430+
await self.upload_jsonl_to_blob(
431+
container_client, resources, storage_container_path_prefix + self.KNOWLEDGE_SOURCE_LIST_FILE_NAME)
431432
await container_client.close()
432433

433434

0 commit comments

Comments
 (0)