clarify strengths and expectations in the sample

chienyuanchang · chienyuanchang · commit 4eafde9b2cb7 · 2025-07-21T21:45:08.000Z
diff --git a/notebooks/field_extraction_pro_mode.ipynb b/notebooks/field_extraction_pro_mode.ipynb
@@ -29,8 +29,9 @@
     "## Prerequisites\n",
     "1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
     "1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.\n",
-    "    - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container. \n",
-    "    - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs. \n",
+    "    - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.\n",
+    "    - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs.\n",
+    "    > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data.\n",
     "1. Install the required packages to run the sample."
    ]
   },
@@ -50,7 +51,7 @@
     "## Analyzer template and local files setup\n",
     "- **analyzer_template**: In this sample we define an analyzer template for invoice-contract verification.\n",
     "- **input_docs**: We can have multiple input document files in one folder or designate a single document file location. \n",
-    "- **reference_docs**: During analyzer creation, we can provide documents that can aid in providing context that the analyzer references at inference time. We will get OCR results for these files if needed, generate a reference JSONL file, and upload these files to a designated Azure blob storage.\n",
+    "- **reference_docs(Optional)**: During analyzer creation, we can provide documents that can aid in providing context that the analyzer references at inference time. We will get OCR results for these files if needed, generate a reference JSONL file, and upload these files to a designated Azure blob storage.\n",
     "\n",
     "> For example, if you're looking to analyze invoices to ensure they're consistent with a contractual agreement, you can supply the invoice and other relevant documents (for example, a purchase order) as inputs, and supply the contract files as reference data. The service applies reasoning to validate the input documents according to your schema, which might be to identify discrepancies to flag for further review."
    ]
@@ -64,6 +65,8 @@
     "# Define paths for analyzer template, input documents, and reference documents\n",
     "analyzer_template = \"../analyzer_templates/invoice_contract_verification_pro_mode.json\"\n",
     "input_docs = \"../data/field_extraction_pro_mode/invoice_contract_verification/input_docs\"\n",
+    "\n",
+    "# NOTE: Reference documents are optional in Pro mode. Can comment out below line if not using reference documents.\n",
     "reference_docs = \"../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs\""
    ]
   },
@@ -158,7 +161,8 @@
     "- Generate a reference `.jsonl` file.\n",
     "- Upload these files to the designated Azure blob storage.\n",
     "\n",
-    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step."
+    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step.\n",
+    "\n"
    ]
   },
   {
@@ -169,11 +173,26 @@
    "source": [
     "# Load reference storage configuration from environment\n",
     "REFERENCE_DOC_SAS_URL = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n",
-    "REFERENCE_DOC_PATH = os.getenv(\"REFERENCE_DOC_PATH\")\n",
-    "\n",
+    "REFERENCE_DOC_PATH = os.getenv(\"REFERENCE_DOC_PATH\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data. Please skip or comment out below section to skip the preparation of reference documents."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "# Set skip_analyze to True if you already have OCR results for the documents in the reference_docs folder\n",
     "# Please name the OCR result files with the same name as the original document files including its extension, and add the suffix \".result.json\"\n",
     "# For example, if the original document is \"invoice.pdf\", the OCR result file should be named \"invoice.pdf.result.json\"\n",
+    "# NOTE: Please comment out the follwing line if you don't have any reference documents.\n",
     "await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)"
    ]
   },
@@ -291,7 +310,14 @@
    "metadata": {},
    "source": [
     "## Bonus sample\n",
-    "We would like to introduce another sample with multiple inputs."
+    "We would like to introduce another sample to highlight how Pro mode supports multi-document input and advanced reasoning. Unlike Document Standard Mode, which processes one document at a time, Pro mode can analyze multiple documents within a single analysis call. With Pro mode, the service not only processes each document independently, but also cross-references the documents to perform reasoning across them, enabling deeper insights and validation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### First, we need to set up variables for the second sample"
    ]
   },
   {
@@ -300,8 +326,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# First, we need to set up variables for the second sample\n",
-    "\n",
     "# Define paths for analyzer template, input documents, and reference documents of the second sample\n",
     "analyzer_template_2 = \"../analyzer_templates/insurance_claims_review_pro_mode.json\"\n",
     "input_docs_2 = \"../data/field_extraction_pro_mode/insurance_claims_review/input_docs\"\n",
@@ -310,14 +334,41 @@
     "# Load reference storage configuration from environment\n",
     "REFERENCE_DOC_SAS_URL_2 = os.getenv(\"REFERENCE_DOC_SAS_URL\")  # Reuse the same blob container\n",
     "REFERENCE_DOC_PATH_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\"  # NOTE: Use a different path for the second sample\n",
-    "CUSTOM_ANALYZER_ID_2 = \"pro-mode-sample-\" + str(uuid.uuid4())\n",
-    "\n",
-    "# Let's try reference docuemnts with existing OCR results for the second sample\n",
+    "CUSTOM_ANALYZER_ID_2 = \"pro-mode-sample-\" + str(uuid.uuid4())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Generate knowledge base for the second sample\n",
+    "Let's try reference docuemnts with existing OCR results for the second sample"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "logging.info(\"Start generating knowledge base for the second sample...\")\n",
-    "await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)\n",
-    "\n",
-    "# We can reuse previous AzureContentUnderstandingClient\n",
-    "logging.info(\"Start creating analyzer for the second sample...\")\n",
+    "await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create analyzer for the second sample\n",
+    "We can reuse previous AzureContentUnderstandingClient"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "response = client.begin_create_analyzer(\n",
     "    CUSTOM_ANALYZER_ID_2,\n",
     "    analyzer_template_path=analyzer_template_2,\n",
@@ -332,9 +383,22 @@
     "    logging.warning(\n",
     "        \"An issue was encountered when trying to create the analyzer. \"\n",
     "        \"Please double-check your deployment and configurations for potential problems.\"\n",
-    "    )\n",
-    "\n",
-    "# Analyze the multiple input documents with the second analyzer\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Analyze the multiple input documents with the second analyzer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "logging.info(\"Start analyzing input documents for the second sample...\")\n",
     "response = client.begin_analyze(CUSTOM_ANALYZER_ID_2, file_location=input_docs_2)\n",
     "result_json = client.poll_result(response, timeout_seconds=600)  # set a longer timeout for pro mode\n",
@@ -348,14 +412,14 @@
     "    json.dump(result_json, file, indent=2)\n",
     "\n",
     "logging.info(f\"Full analyzer result saved to: {output_path}\")\n",
-    "display(FileLink(output_path))\n"
+    "display(FileLink(output_path))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "> Let's take a deeper look at `LineItemCorroboration` field in the result"
+    "### Let's take a deeper look at `LineItemCorroboration` field in the result"
    ]
   },
   {
@@ -375,13 +439,19 @@
     "> In the LineItemCorroboration field, we see that each line item is extracted with its corresponding information, claim status, and evidence. Items that are not covered by the policy, such as the Starbucks drink and hotel stay, are not confirmed, while damage repairs that are supported by the supplied documents in the claim and are permitted by the policy are confirmed."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### [Optional] Delete the analyzer for second sample after use"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# [Optional] Delete the analyzer for second sample after use\n",
     "client.delete_analyzer(CUSTOM_ANALYZER_ID_2)"
    ]
   }