diff --git a/notebooks/analyzer_training.ipynb b/notebooks/analyzer_training.ipynb index 773277f..58a1dd0 100644 --- a/notebooks/analyzer_training.ipynb +++ b/notebooks/analyzer_training.ipynb @@ -4,30 +4,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Enhance your analyzer with labeled data\n", + "# Enhance Your Analyzer with Labeled Data\n", "\n", "\n", "> #################################################################################\n", ">\n", - "> Note: Currently this feature is only available for analyzer scenario is `document`\n", + "> Note: Currently, this feature is only available when the analyzer scenario is set to `document`.\n", ">\n", "> #################################################################################\n", "\n", - "Labeled data is a group of samples that have been tagged with one or more labels to add context or meaning, which is used to improve analyzer's performance.\n", + "Labeled data consists of samples that have been tagged with one or more labels to add context or meaning. This additional information is used to improve the analyzer's performance.\n", "\n", - "In your own project, you will use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to use the labeling tool to annotate your data.\n", - "\n", - "In this notebook we will demonstrate after you have the labeled data, how to create analyzer with them and analyze your files.\n", + "In your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data with the labeling tool.\n", "\n", + "This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward.\n", "\n", "\n", "## Prerequisites\n", - "1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n", - "2. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related environment variables into the [.env](./.env) file.\n", - " - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n", - " - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n", - " - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where training data will be uploaded.\n", - "3. Install packages needed to run the sample\n" + "1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).\n", + "2. Set environment variables related to training data by following the steps in [Set env for training data](../docs/set_env_for_training_data_and_reference_doc.md) and adding them to the [.env](./.env) file.\n", + " - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n", + " - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` to generate the SAS URL automatically during later steps.\n", + " - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where the training data will be uploaded.\n", + "3. Install the packages required to run the sample:\n" ] }, { @@ -43,13 +42,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Analyzer template and local training folder set up\n", - "In this sample we define a template for receipts.\n", + "## Analyzer Template and Local Training Folder Setup\n", + "In this sample, we define a template for receipts.\n", "\n", "The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:\n", "- The original file (e.g., PDF or image).\n", - "- A corresponding labels.json file with labeled fields.\n", - "- A corresponding result.json file with OCR results." + "- A corresponding `labels.json` file with labeled fields.\n", + "- A corresponding `result.json` file with OCR results." ] }, { @@ -66,15 +65,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create Azure content understanding client\n", - "> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n", + "## Create Azure Content Understanding Client\n", + "> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that contains helper functions. Before the official release of the Content Understanding SDK, please consider it a lightweight SDK.\n", + ">\n", + "> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n", "\n", "> ⚠️ Important:\n", "You must update the code below to match your Azure authentication method.\n", "Look for the `# IMPORTANT` comments and modify those sections accordingly.\n", "If you skip this step, the sample may not run correctly.\n", "\n", - "> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments." + "> ⚠️ Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments." ] }, { @@ -91,7 +92,7 @@ "from dotenv import find_dotenv, load_dotenv\n", "from azure.identity import DefaultAzureCredential, get_bearer_token_provider\n", "\n", - "# import utility package from python samples root directory\n", + "# Import utility package from the Python samples root directory\n", "parent_dir = Path(Path.cwd()).parent\n", "sys.path.append(str(parent_dir))\n", "from python.content_understanding_client import AzureContentUnderstandingClient\n", @@ -109,7 +110,7 @@ " token_provider=token_provider,\n", " # IMPORTANT: Uncomment this if using subscription key\n", " # subscription_key=os.getenv(\"AZURE_AI_API_KEY\"),\n", - " x_ms_useragent=\"azure-ai-content-understanding-python/analyzer_training\", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.\n", + " x_ms_useragent=\"azure-ai-content-understanding-python/analyzer_training\", # This header is used for sample usage telemetry; please comment out this line if you want to opt out.\n", ")" ] }, @@ -117,12 +118,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Prepare labeled data\n", - "In this step, we will\n", - "- Use `TRAINING_DATA_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n", - "- Try to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n", - "If this is not set, we attempt to generate the SAS URL automatically using the environment variables `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`.\n", - "- Verify that document files in the local folder have corresponding `.labels.json` and `.result.json` files\n", + "## Prepare Labeled Data\n", + "In this step, we will:\n", + "- Use the environment variables `TRAINING_DATA_PATH` and SAS URL related variables set in the Prerequisites step.\n", + "- Attempt to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n", + "- If `TRAINING_DATA_SAS_URL` is not set, try generating it automatically using `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` environment variables.\n", + "- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.\n", "- Upload these files to the Azure Blob storage container specified by the environment variables." ] }, @@ -138,10 +139,10 @@ " TRAINING_DATA_CONTAINER_NAME = os.getenv(\"TRAINING_DATA_CONTAINER_NAME\")\n", " if not TRAINING_DATA_STORAGE_ACCOUNT_NAME and not training_data_sas_url:\n", " raise ValueError(\n", - " \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n", + " \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n", " )\n", " from azure.storage.blob import ContainerSasPermissions\n", - " # We will need \"Write\" for uploading, modifying, or appending blobs\n", + " # We require \"Read\", \"Write\", and \"List\" permissions for uploading, modifying, or listing blobs\n", " training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n", " account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,\n", " container_name=TRAINING_DATA_CONTAINER_NAME,\n", @@ -158,10 +159,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create analyzer with defined schema\n", - "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n", + "## Create Analyzer with Defined Schema\n", + "Before creating the analyzer, fill in the constant `ANALYZER_ID` with a relevant name for your task. In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.\n", "\n", - "We use **training_data_sas_url** and **training_data_path** that's set up in the [.env](./.env) file and used in the previous step." + "We use **training_data_sas_url** and **training_data_path** as set in the [.env](./.env) file and used in the previous step." ] }, { @@ -194,8 +195,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Use created analyzer to extract document content\n", - "After the analyzer is successfully created, we can use it to analyze our input files." + "## Use Created Analyzer to Extract Document Content\n", + "After the analyzer is successfully created, you can use it to analyze your input files." ] }, { @@ -214,8 +215,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Delete exist analyzer in Content Understanding Service\n", - "This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse." + "## Delete Existing Analyzer in Content Understanding Service\n", + "This snippet is optional and is included to prevent test analyzers from remaining in your service. Without deletion, the analyzer will stay in your service and may be reused in subsequent operations." ] }, { @@ -249,4 +250,4 @@ }, "nbformat": 4, "nbformat_minor": 2 -} +} \ No newline at end of file