revise for comments

chienyuanchang · chienyuanchang · commit c5ad17f65c39 · 2025-07-16T14:28:26.000Z
diff --git a/notebooks/content_understanding_pro_mode.ipynb b/notebooks/content_understanding_pro_mode.ipynb
@@ -13,10 +13,10 @@
     ">\n",
     "> #################################################################################\n",
     "\n",
-    "This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning, and complex decision-making (for instance, identifying inconsistencies, drawing inferences, and making sophisticated decisions). The pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.\n",
+    "This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning, and complex decision-making (for instance, identifying inconsistencies, drawing inferences, and making sophisticated decisions). Pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.\n",
     "\n",
     "In this walkthrough, you'll learn how to:\n",
-    "1. Create an analyzer with reference data.\n",
+    "1. Create an analyzer with a schema and reference data.\n",
     "2. Analyze your files using Pro mode.\n",
     "\n",
     "For more details on Pro mode, see the [Azure AI Content Understanding: Standard and Pro Modes](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/standard-pro-modes) documentation."
@@ -48,9 +48,9 @@
    "metadata": {},
    "source": [
     "## Analyzer template and local files setup\n",
-    "- **analyzer_template**: In this sample we define an analyzer template for invoice contract verification.\n",
-    "- **input_docs**: We can have multiple input document files in one folder or designate a single document file location \n",
-    "- **reference_dos**: During analyzer creation, we can provide documents that can aid in providing context that references the service at inference time. We will get ocr results for these files if needed, generate a reference jsonl file, and upload these files to a designated Azure blob storage.\n",
+    "- **analyzer_template**: In this sample we define an analyzer template for invoice-contract verification.\n",
+    "- **input_docs**: We can have multiple input document files in one folder or designate a single document file location. \n",
+    "- **reference_docs**: During analyzer creation, we can provide documents that can aid in providing context that the analyzer references at inference time. We will get OCR results for these files if needed, generate a reference JSONL file, and upload these files to a designated Azure blob storage.\n",
     "\n",
     "> For example, if you're looking to analyze invoices to ensure they're consistent with a contractual agreement, you can supply the invoice and other relevant documents (for example, a purchase order) as inputs, and supply the contract files as reference data. The service applies reasoning to validate the input documents according to your schema, which might be to identify discrepancies to flag for further review."
    ]
@@ -72,7 +72,7 @@
    "metadata": {},
    "source": [
     "## Create Azure content understanding client\n",
-    "> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility Class which contain the functions to interact with the Content Understanding server. Before Content Understanding SDK release, we can regard it as a lightweight SDK. Fill the constant **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n",
+    "> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n",
     "\n",
     "> ⚠️ Important:\n",
     "You must update the code below to match your Azure authentication method.\n",
@@ -134,7 +134,7 @@
     "- Generate a reference `.jsonl` file.\n",
     "- Upload these files to the designated Azure blob storage.\n",
     "\n",
-    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the prerequisite step."
+    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step."
    ]
   },
   {
@@ -155,9 +155,9 @@
    "metadata": {},
    "source": [
     "## Create analyzer with defined schema for Pro mode\n",
-    "Before creating the custom fields analyzer, you should fill the constant ANALYZER_ID with a business-related name. Here we randomly generate a name for demo purpose.\n",
+    "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
     "\n",
-    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in `.env` file and used in the previous step."
+    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the `.env` file and used in the previous step."
    ]
   },
   {
@@ -177,21 +177,22 @@
     ")\n",
     "result = client.poll_result(response)\n",
     "if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
-    "    logging.info(f\"Here is the analyzer detail for {result['result']['analyzerId']}\")\n",
+    "    logging.info(f\"Analyzer details for {result['result']['analyzerId']}\")\n",
     "    logging.info(json.dumps(result, indent=2))\n",
     "else:\n",
-    "    logging.info(\n",
-    "        \"Check your service please, may be some issues in configuration and deployment\"\n",
+    "    logging.warning(\n",
+    "        \"An issue was encountered when trying to create the analyzer. \"\n",
+    "        \"Please double-check your deployment and configurations for potential problems.\"\n",
     "    )"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Use created analyzer to do data analysis of the input documents\n",
+    "## Use created analyzer to analyze the input documents\n",
     "After the analyzer is successfully created, we can use it to analyze our input files.\n",
-    "> NOTE: Pro mode does multi-step reasoning and may take longer time to analyze."
+    "> NOTE: Pro mode does multi-step reasoning and may take a longer time to analyze."
    ]
   },
   {
@@ -211,7 +212,7 @@
    "metadata": {},
    "source": [
     "## Delete exist analyzer in Content Understanding Service\n",
-    "This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. The custom fields analyzer could be stored in your service for reusing by subsequent business in real usage scenarios."
+    "This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse."
    ]
   },
   {
diff --git a/python/content_understanding_client.py b/python/content_understanding_client.py
@@ -15,8 +15,8 @@
 class AzureContentUnderstandingClient:
 
     PREBUILT_DOCUMENT_ANALYZER_ID: str = "prebuilt-documentAnalyzer"
-    RESULT_SUFFIX: str = ".result.json"
-    SOURCES_JSONL: str = "sources.jsonl"
+    OCR_RESULT_FILE_SUFFIX: str = ".result.json"
+    KNOWLEDGE_SOURCE_LIST_FILE_NAME: str = "sources.jsonl"
 
     # https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/service-limits#document-and-text
     SUPPORTED_FILE_TYPES: List[str] = [
@@ -98,7 +98,7 @@ def _get_pro_mode_reference_docs_config(
             "kind": "reference",
             "containerUrl": storage_container_sas_url,
             "prefix": storage_container_path_prefix,
-            "fileListPath": self.SOURCES_JSONL,
+            "fileListPath": self.KNOWLEDGE_SOURCE_LIST_FILE_NAME,
         }]
 
     def _get_classifier_url(self, endpoint: str, api_version: str, classifier_id: str) -> str:
@@ -133,7 +133,7 @@ def is_supported_type_by_file_path(file_path: Path, is_pro_mode: bool=False) ->
 
         Args:
             file_path (Path): The path to the file to check.
-            is_pro_mode (bool): If True, checks against pro mode supported file types.
+            is_pro_mode (bool): If True, checks against Pro mode supported file types.
 
         Returns:
             bool: True if the file type is supported, False otherwise.
@@ -154,7 +154,7 @@ def is_supported_type_by_file_ext(file_ext: str, is_pro_mode: bool=False) -> boo
 
         Args:
             file_ext (str): The file extension to check.
-            is_pro_mode (bool): If True, checks against pro mode supported file types.
+            is_pro_mode (bool): If True, checks against Pro mode supported file types.
 
         Returns:
             bool: True if the file type is supported, False otherwise.
@@ -311,7 +311,7 @@ def begin_analyze(self, analyzer_id: str, file_location: str) -> Response:
         file_path = Path(file_location)
         if file_path.exists():
             if file_path.is_dir():
-                # Only pro mode supports multiple input files
+                # Only Pro mode supports multiple input files
                 data = {
                     "inputs": [
                         {
@@ -359,7 +359,7 @@ def begin_analyze(self, analyzer_id: str, file_location: str) -> Response:
         )
         return response
     
-    def get_analyze_result(self, file_location: str) -> Dict[str, Any]:
+    def get_prebuilt_document_analyze_result(self, file_location: str) -> Dict[str, Any]:
         response = self.begin_analyze(
             analyzer_id=self.PREBUILT_DOCUMENT_ANALYZER_ID,
             file_location=file_location,
@@ -392,25 +392,25 @@ async def upload_jsonl_to_blob(
 
     async def generate_knowledge_base_on_blob(
         self,
-        referemce_docs_folder: str,
+        reference_docs_folder: str,
         storage_container_sas_url: str,
         storage_container_path_prefix: str,
         skip_analyze: bool = False,
     ) -> None:
         container_client = ContainerClient.from_container_url(storage_container_sas_url)
         resources = []
-        for dirpath, _, filenames in os.walk(referemce_docs_folder):
+        for dirpath, _, filenames in os.walk(reference_docs_folder):
             for filename in filenames:
                 filename_no_ext, file_ext = os.path.splitext(filename)
                 if self.is_supported_type_by_file_ext(file_ext, is_pro_mode=True):
                     file_path = os.path.join(dirpath, filename)
-                    result_file_name = filename_no_ext + self.RESULT_SUFFIX
+                    result_file_name = filename_no_ext + self.OCR_RESULT_FILE_SUFFIX
                     result_file_blob_path = storage_container_path_prefix + result_file_name
                     # Get and upload result.json
                     if not skip_analyze:
                         self._logger.info(f"Analyzing result for {filename}")
                         try:
-                            analyze_result = self.get_analyze_result(file_path)
+                            analyze_result = self.get_prebuilt_document_analyze_result(file_path)
                         except Exception as e:
                             self._logger.error(f"Error of getting analyze result of {filename}: {e}")
                             continue
@@ -427,7 +427,8 @@ async def generate_knowledge_base_on_blob(
                     await self._upload_file_to_blob(container_client, file_path, file_blob_path)
                     resources.append({"file": filename, "resultFile": result_file_name})
         # Upload sources.jsonl
-        await self.upload_jsonl_to_blob(container_client, resources, storage_container_path_prefix + self.SOURCES_JSONL)
+        await self.upload_jsonl_to_blob(
+            container_client, resources, storage_container_path_prefix + self.KNOWLEDGE_SOURCE_LIST_FILE_NAME)
         await container_client.close()