diff --git a/notebooks/field_extraction_pro_mode.ipynb b/notebooks/field_extraction_pro_mode.ipynb index 45a5044..138986d 100644 --- a/notebooks/field_extraction_pro_mode.ipynb +++ b/notebooks/field_extraction_pro_mode.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Conduct complex analysis with Pro mode\n", + "# Conduct Complex Analysis with Pro Mode\n", "\n", "> #################################################################################\n", ">\n", @@ -13,7 +13,7 @@ ">\n", "> #################################################################################\n", "\n", - "This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning, and complex decision-making (for instance, identifying inconsistencies, drawing inferences, and making sophisticated decisions). Pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.\n", + "This notebook demonstrates how to use **Pro mode** in Azure AI Content Understanding to enhance your analyzer with multiple inputs and optional reference data. Pro mode is designed for advanced use cases, particularly those requiring multi-step reasoning and complex decision-making (for example, identifying inconsistencies, drawing inferences, and making sophisticated decisions). Pro mode allows input from multiple content files and includes the option to provide reference data at analyzer creation time.\n", "\n", "In this walkthrough, you'll learn how to:\n", "1. Create an analyzer with a schema and reference data.\n", @@ -27,13 +27,13 @@ "metadata": {}, "source": [ "## Prerequisites\n", - "1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n", - "1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up reference document related environment variables in the [.env](./.env) file.\n", - " - You can either set `REFERENCE_DOC_SAS_URL` directly with the SAS URL for your Azure Blob container,\n", - " - Or set both `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n", - " - Also set `REFERENCE_DOC_PATH` to specify the folder path within the container where reference documents will be uploaded.\n", - " > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data.\n", - "1. Install the required packages to run the sample." + "1. Ensure the Azure AI service is configured by following the [setup steps](../README.md#configure-azure-ai-service-resource).\n", + "2. If using reference documents, follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to configure reference document environment variables in the [.env](./.env) file.\n", + " - You can set `REFERENCE_DOC_SAS_URL` directly with the SAS URL for your Azure Blob container.\n", + " - Alternatively, set both `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME` so that the SAS URL can be generated automatically during a later step.\n", + " - Also, set `REFERENCE_DOC_PATH` to specify the folder path within the container where reference documents will be uploaded.\n", + " > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using only input documents. For example, the service can reason across two or more input files without any reference data.\n", + "3. Install the required packages to run the sample." ] }, { @@ -49,12 +49,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Analyzer template and local files setup\n", - "- **analyzer_template**: In this sample we define an analyzer template for invoice-contract verification.\n", - "- **input_docs**: We can have multiple input document files in one folder or designate a single document file location. \n", - "- **reference_docs(Optional)**: During analyzer creation, we can provide documents that can aid in providing context that the analyzer references at inference time. We will get OCR results for these files if needed, generate a reference JSONL file, and upload these files to a designated Azure blob storage.\n", + "## Analyzer Template and Local Files Setup\n", + "- **analyzer_template**: In this sample, we define an analyzer template for invoice-contract verification.\n", + "- **input_docs**: You can have multiple input document files in one folder or specify a single document file location.\n", + "- **reference_docs (Optional)**: During analyzer creation, you can provide documents that aid in providing contextual information for the analyzer to reference during inference. OCR results will be extracted from these files if needed, a reference JSONL file will be generated, and these files will be uploaded to a designated Azure Blob Storage container.\n", "\n", - "> For example, if you're looking to analyze invoices to ensure they're consistent with a contractual agreement, you can supply the invoice and other relevant documents (for example, a purchase order) as inputs, and supply the contract files as reference data. The service applies reasoning to validate the input documents according to your schema, which might be to identify discrepancies to flag for further review." + "> For example, if you're analyzing invoices to ensure their consistency with a contractual agreement, you can supply the invoice and other relevant documents (e.g., a purchase order) as inputs, and provide the contract files as reference data. The service applies reasoning to validate the input documents against your schema, which might include identifying discrepancies to flag for further review." ] }, { @@ -67,7 +67,7 @@ "analyzer_template = \"../analyzer_templates/invoice_contract_verification_pro_mode.json\"\n", "input_docs = \"../data/field_extraction_pro_mode/invoice_contract_verification/input_docs\"\n", "\n", - "# NOTE: Reference documents are optional in Pro mode. Can comment out below line if not using reference documents.\n", + "# NOTE: Reference documents are optional in Pro mode. Comment out the line below if not using reference documents.\n", "reference_docs = \"../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs\"" ] }, @@ -75,7 +75,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Let's take a look at the analyzer template of Pro mode" + "> Let's examine the Pro mode analyzer template." ] }, { @@ -93,22 +93,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> In the analyzer, `\"mode\"` needs to be in `\"pro\"`. The defined field - \"PaymentTermsInconsistencies\" is a `\"generate\"` field and is asked to reason about inconsistency, and will be able to use referenced documents to be uploaded in [reference docs](../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs)" + "> In the analyzer, the field `\"mode\"` must be set to `\"pro\"`. The defined field `\"PaymentTermsInconsistencies\"` is a `\"generate\"` field designed to reason about inconsistencies. It can utilize the referenced documents uploaded in [reference docs](../data/field_extraction_pro_mode/invoice_contract_verification/reference_docs)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Create Azure content understanding client\n", - "> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n", + "## Create Azure Content Understanding Client\n", + "> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class containing client functions. Before the official release of the Content Understanding SDK, consider it a lightweight SDK. Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with your Azure AI Service details.\n", "\n", "> ⚠️ Important:\n", "You must update the code below to match your Azure authentication method.\n", "Look for the `# IMPORTANT` comments and modify those sections accordingly.\n", "If you skip this step, the sample may not run correctly.\n", "\n", - "> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments." + "> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is recommended and safer for production environments." ] }, { @@ -124,7 +124,7 @@ "from dotenv import find_dotenv, load_dotenv\n", "from azure.identity import DefaultAzureCredential, get_bearer_token_provider\n", "\n", - "# import utility package from python samples root directory\n", + "# Import utility package from python samples root directory\n", "parent_dir = Path(Path.cwd()).parent\n", "sys.path.append(str(parent_dir))\n", "from python.content_understanding_client import AzureContentUnderstandingClient\n", @@ -132,9 +132,9 @@ "load_dotenv(find_dotenv())\n", "logging.basicConfig(level=logging.INFO)\n", "\n", - "# For authentication, you can use either token-based auth or subscription key, and only one of them is required\n", + "# For authentication, you can use either token-based auth or subscription key; only one is required\n", "AZURE_AI_ENDPOINT = os.getenv(\"AZURE_AI_ENDPOINT\")\n", - "# IMPORTANT: Replace with your actual subscription key or set up in \".env\" file if not using token auth\n", + "# IMPORTANT: Replace with your actual subscription key or set it in the \".env\" file if not using token auth\n", "AZURE_AI_API_KEY = os.getenv(\"AZURE_AI_API_KEY\")\n", "AZURE_AI_API_VERSION = os.getenv(\"AZURE_AI_API_VERSION\", \"2025-05-01-preview\")\n", "\n", @@ -146,9 +146,9 @@ " api_version=AZURE_AI_API_VERSION,\n", " # IMPORTANT: Comment out token_provider if using subscription key\n", " token_provider=token_provider,\n", - " # IMPORTANT: Uncomment this if using subscription key\n", + " # IMPORTANT: Uncomment this line if using subscription key\n", " # subscription_key=AZURE_AI_API_KEY,\n", - " x_ms_useragent=\"azure-ai-content-understanding-python/pro_mode\", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.\n", + " x_ms_useragent=\"azure-ai-content-understanding-python/pro_mode\", # This header is used for sample usage telemetry; comment out if you want to opt out.\n", ")" ] }, @@ -156,14 +156,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Prepare reference data\n", - "In this step, we will \n", - "- Use `REFERENCE_DOC_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n", - "- Try to get the SAS URL from the environment variable `REFERENCE_DOC_SAS_URL`.\n", - "If this is not set, we attempt to generate the SAS URL automatically using the environment variables `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`.\n", - "- Use Azure AI service to Extract OCR results from reference documents (if needed).\n", + "## Prepare Reference Data\n", + "In this step, we will:\n", + "- Use `REFERENCE_DOC_PATH` and SAS URL related environment variables that were set in the Prerequisites.\n", + "- Attempt to obtain the SAS URL from the `REFERENCE_DOC_SAS_URL` environment variable.\n", + " If this is not set, the SAS URL will be generated automatically using the `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME` environment variables.\n", + "- Use Azure AI service to extract OCR results from reference documents if needed.\n", "- Generate a reference `.jsonl` file.\n", - "- Upload these files to the designated Azure blob storage.\n" + "- Upload these files to the designated Azure Blob Storage.\n" ] }, { @@ -181,7 +181,7 @@ " REFERENCE_DOC_CONTAINER_NAME = os.getenv(\"REFERENCE_DOC_CONTAINER_NAME\")\n", " if REFERENCE_DOC_STORAGE_ACCOUNT_NAME and REFERENCE_DOC_CONTAINER_NAME:\n", " from azure.storage.blob import ContainerSasPermissions\n", - " # We will need \"Write\" for uploading, modifying, or appending blobs\n", + " # We require \"Write\" permission to upload, modify, or append blobs\n", " reference_doc_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n", " account_name=REFERENCE_DOC_STORAGE_ACCOUNT_NAME,\n", " container_name=REFERENCE_DOC_CONTAINER_NAME,\n", @@ -194,7 +194,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data. Please skip or comment out below section to skip the preparation of reference documents." + "> ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using only input documents. For example, the service can reason across two or more input files without any reference data. To skip reference document preparation, skip or comment out the code in the following section." ] }, { @@ -204,9 +204,9 @@ "outputs": [], "source": [ "# Set skip_analyze to True if you already have OCR results for the documents in the reference_docs folder\n", - "# Please name the OCR result files with the same name as the original document files including its extension, and add the suffix \".result.json\"\n", + "# Please name the OCR result files with the same name as the original documents including extension, and add the suffix \".result.json\"\n", "# For example, if the original document is \"invoice.pdf\", the OCR result file should be named \"invoice.pdf.result.json\"\n", - "# NOTE: Please comment out the follwing line if you don't have any reference documents.\n", + "# NOTE: Comment out the following line if you do not have any reference documents.\n", "await client.generate_knowledge_base_on_blob(reference_docs, reference_doc_sas_url, reference_doc_path, skip_analyze=False)" ] }, @@ -214,10 +214,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create analyzer with defined schema for Pro mode\n", - "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n", + "## Create Analyzer with Defined Schema for Pro Mode\n", + "Before creating the analyzer, assign a relevant name to the constant **CUSTOM_ANALYZER_ID** for your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n", "\n", - "We use **reference_doc_sas_url** and **reference_doc_path** that's set up in the [.env](./.env) file and used in the previous step." + "We use **reference_doc_sas_url** and **reference_doc_path** configured in the [.env](./.env) file and utilized in the previous step." ] }, { @@ -250,9 +250,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Use created analyzer to analyze the input documents\n", - "After the analyzer is successfully created, we can use it to analyze our input files.\n", - "> NOTE: Pro mode does multi-step reasoning and may take a longer time to analyze." + "## Use Created Analyzer to Analyze Input Documents\n", + "After the analyzer is successfully created, it can be used to analyze input files.\n", + "> NOTE: Pro mode performs multi-step reasoning and may require longer analysis times." ] }, { @@ -264,9 +264,9 @@ "from IPython.display import FileLink, display\n", "\n", "response = client.begin_analyze(CUSTOM_ANALYZER_ID, file_location=input_docs)\n", - "result_json = client.poll_result(response, timeout_seconds=600) # set a longer timeout for pro mode\n", + "result_json = client.poll_result(response, timeout_seconds=600) # Use a longer timeout for Pro mode\n", "\n", - "# Create the output directory if it doesn't exist\n", + "# Create output directory if it doesn't exist\n", "output_dir = \"output\"\n", "os.makedirs(output_dir, exist_ok=True)\n", "\n", @@ -282,7 +282,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Let's take a look at the extracted fields with Pro mode " + "> Let's review the extracted fields with Pro mode." ] }, { @@ -299,15 +299,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> As seen in the field `PaymentTermsInconsistencies`, for example, the purchase contract has detailed payment terms that were agreed to prior to the service. However, the implied payment terms on the invoice conflict with this. Pro mode was able to identify the corresponding contract for this invoice from the reference documents and then analyze the contract together with the invoice to discover this inconsistency." + "> As seen in the `PaymentTermsInconsistencies` field, for example, the purchase contract details the payment terms agreed upon prior to the service. However, the implied payment terms on the invoice conflict with these. Pro mode identified the corresponding contract from the reference documents and analyzed it alongside the invoice to discover this inconsistency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Delete exist analyzer in Content Understanding Service\n", - "This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse." + "## Delete Existing Analyzer in Content Understanding Service\n", + "This step is optional. It removes the test analyzer from your service to prevent it from persisting. Without deletion, the analyzer will remain for potential reuse." ] }, { @@ -323,15 +323,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Bonus sample\n", - "We would like to introduce another sample to highlight how Pro mode supports multi-document input and advanced reasoning. Unlike Document Standard Mode, which processes one document at a time, Pro mode can analyze multiple documents within a single analysis call. With Pro mode, the service not only processes each document independently, but also cross-references the documents to perform reasoning across them, enabling deeper insights and validation." + "## Bonus Sample\n", + "This sample demonstrates how Pro mode supports multi-document input and advanced reasoning. Unlike Document Standard Mode, which processes one document at a time, Pro mode can analyze multiple documents within a single analysis call. The service not only processes each document independently but also cross-references documents to perform reasoning across them, enabling deeper insights and validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### First, we need to set up variables for the second sample" + "### First, Set Up Variables for the Second Sample" ] }, { @@ -354,8 +354,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Generate knowledge base for the second sample\n", - "Let's upload [refernce documents](../data/field_extraction_pro_mode/insurance_claims_review/reference_docs/) with existing OCR results for the second sample. These documents contain driver coverage policy that are useful in reviewing insurance claims." + "### Generate Knowledge Base for the Second Sample\n", + "Upload [reference documents](../data/field_extraction_pro_mode/insurance_claims_review/reference_docs/) with existing OCR results for the second sample. These documents contain driver coverage policies useful for reviewing insurance claims." ] }, { @@ -373,8 +373,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Create analyzer for the second sample\n", - "We can reuse previous AzureContentUnderstandingClient" + "### Create Analyzer for the Second Sample\n", + "Reuse the existing AzureContentUnderstandingClient." ] }, { @@ -404,13 +404,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Analyze the multiple input documents with the second analyzer\n", - "Please note that the [input_docs_2](../data/field_extraction_pro_mode/insurance_claims_review/input_docs/) directory contains two PDF files as input: one is a car accident report, and the other is a repair estimate.\n", + "### Analyze Multiple Input Documents with the Second Analyzer\n", + "The [input_docs_2](../data/field_extraction_pro_mode/insurance_claims_review/input_docs/) folder contains two PDFs as input: a car accident report and a repair estimate.\n", "\n", - "The first document includes details such as the car’s license plate number, vehicle model, and other incident-related information.\n", + "The first document includes details such as the car’s license plate number, vehicle model, and incident information.\n", "The second document provides a breakdown of the estimated repair costs.\n", "\n", - "Due to the complexity of this multi-document scenario and the processing involved, it may take a few minutes to generate the results." + "Because of the complexity of this multi-document scenario, analysis may take several minutes." ] }, { @@ -421,10 +421,10 @@ "source": [ "logging.info(\"Start analyzing input documents for the second sample...\")\n", "response = client.begin_analyze(CUSTOM_ANALYZER_ID_2, file_location=input_docs_2)\n", - "result_json = client.poll_result(response, timeout_seconds=600) # set a longer timeout for pro mode\n", + "result_json = client.poll_result(response, timeout_seconds=600) # Use a longer timeout for Pro mode\n", "\n", "# Save the result to a JSON file\n", - "# Create the output directory if it doesn't exist\n", + "# Create output directory if it doesn't exist\n", "output_dir = \"output\"\n", "os.makedirs(output_dir, exist_ok=True)\n", "output_path = os.path.join(output_dir, f\"{CUSTOM_ANALYZER_ID_2}_result.json\")\n", @@ -439,7 +439,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Let's take a look at the analyze result" + "### Review Analyze Result" ] }, { @@ -455,16 +455,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Let's take a deeper look at `LineItemCorroboration` field in the result" + "### Examine `LineItemCorroboration` Field in Detail" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "> We can see that the field `ReportingOfficer` is only available in the car accident report, while fields like `VIN` come solely from the repair estimate document. This shows that information is extracted from both documents to generate a single result. It also illustrates the N:1 relationship between the inputs and the analysis result. \n", - "\n", - "> Multiple input documents are combined to produce one unified output. There is always one analysis result, and this is not a batch model where N input documents would yield N outputs." + "> The `ReportingOfficer` field is only present in the car accident report, while fields such as `VIN` come exclusively from the repair estimate document. This demonstrates how information is extracted from both documents to generate a single unified result.\n", + "> \n", + "> Multiple input documents are combined to produce one consolidated output. This is a single-analysis result, unlike a batch model where N input documents yield N outputs." ] }, { @@ -481,14 +481,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> In the `LineItemCorroboration` field, we see that each line item, generated from *repair estimate document*, is extracted with its corresponding information, claim status, and evidence. Items that are not covered by the policy, such as the Starbucks drink and hotel stay, are marked as suspicious, while damage repairs that are supported by the supplied documents in the claim and are permitted by the policy are confirmed." + "> Within the `LineItemCorroboration` field, each line item from the *repair estimate document* is extracted along with its claim status and evidence. Items not covered by the policy, such as a Starbucks drink and hotel stay, are marked as suspicious. Valid damage repairs supported by the claim documents and permitted under the policy are confirmed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### [Optional] Delete the analyzer for second sample after use" + "### [Optional] Delete the Analyzer for the Second Sample After Use" ] }, { @@ -522,4 +522,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file