Skip to content

Review main-notebooks/analyzer_training.ipynb #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 38 additions & 37 deletions notebooks/analyzer_training.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,29 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Enhance your analyzer with labeled data\n",
"# Enhance Your Analyzer with Labeled Data\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Consistency, Formatting]
    • change: Capitalized each major word in the comment header from “Enhance your analyzer with labeled data” to “Enhance Your Analyzer with Labeled Data.”
    • rationale: This aligns the comment header style with conventional title casing used in documentation headers, making it consistent with typical formatting standards.
    • impact: Improves readability and presents a more professional, polished appearance in the documentation.

"\n",
"> #################################################################################\n",
">\n",
"> Note: Currently this feature is only available for analyzer scenario is `document`\n",
"> Note: Currently, this feature is only available when the analyzer scenario is set to `document`.\n",
">\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity]
    • change: Revised the sentence from "Currently this feature is only available for analyzer scenario is document" to "Currently, this feature is only available when the analyzer scenario is set to document."
    • rationale: Corrected grammatical errors and restructured the sentence for clearer, more natural English.
    • impact: Enhances readability and ensures the note communicates the intended information more effectively to users.

"> #################################################################################\n",
"\n",
"Labeled data is a group of samples that have been tagged with one or more labels to add context or meaning, which is used to improve analyzer's performance.\n",
"Labeled data consists of samples that have been tagged with one or more labels to add context or meaning. This additional information is used to improve the analyzer's performance.\n",
"\n",
"In your own project, you will use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to use the labeling tool to annotate your data.\n",
"\n",
"In this notebook we will demonstrate after you have the labeled data, how to create analyzer with them and analyze your files.\n",
"In your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data with the labeling tool.\n",
"\n",
"This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward.\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity]

    • change: Updated the description of labeled data for improved sentence structure and clarity.
    • rationale: The original sentence was somewhat awkward and contained a grammatical issue ("improve analyzer's performance" missing article), so it was rephrased to be clearer and grammatically correct.
    • impact: Provides a clearer and more professional explanation of what labeled data is, improving reader comprehension.
  • categories: [Grammar, Clarity, Consistency]

    • change: Changed "In your own project, you will use..." to "In your own projects, you can use..."; simplified the sentence about using Azure AI Foundry.
    • rationale: Pluralizing "project" makes it more general and the modal verb "can" indicates possibility rather than obligation, which is more instructive. The sentence was also streamlined for better readability.
    • impact: Enhances the tone to be more helpful and less prescriptive, improving user guidance and readability.
  • categories: [Grammar, Clarity]

    • change: Rephrased the original sentence "In this notebook we will demonstrate after you have the labeled data, how to create analyzer with them and analyze your files." to a clearer and grammatically correct sentence: "This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward."
    • rationale: The original sentence had awkward phrasing and was grammatically incorrect ("create analyzer with them"). The revised sentence is more concise and easier to understand.
    • impact: Improves clarity and professionalism of the documentation, making the instructions easier to follow.

"\n",
"## Prerequisites\n",
"1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
"2. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related environment variables into the [.env](./.env) file.\n",
" - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n",
" - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n",
" - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where training data will be uploaded.\n",
"3. Install packages needed to run the sample\n"
"1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).\n",
"2. Set environment variables related to training data by following the steps in [Set env for training data](../docs/set_env_for_training_data_and_reference_doc.md) and adding them to the [.env](./.env) file.\n",
" - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n",
" - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` to generate the SAS URL automatically during later steps.\n",
" - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where the training data will be uploaded.\n",
"3. Install the packages required to run the sample:\n"
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity, Consistency]

    • change: Rephrased step 1 from "Ensure Azure AI service is configured following [steps]" to "Ensure your Azure AI service is configured by following the [configuration steps]".
    • rationale: Improved sentence structure for better readability and added possessive "your" for personalization and clarity. Also made the link text more descriptive.
    • impact: Enhanced readability and user engagement by making instructions clearer and more direct.
  • categories: [Grammar, Clarity, Consistency]

    • change: Rewrote step 2 to explicitly mention setting environment variables by "following the steps ... and adding them to the [.env] file," restructuring list items with consistent indentation and clearer wording.
    • rationale: Improved clarity and consistency by making the instruction more explicit and simplifying list item structures to enhance understanding.
    • impact: Users can more easily understand and follow the instructions for configuring environment variables.
  • categories: [Grammar, Clarity]

    • change: Reworded list items under step 2 to use parallel structure and remove redundant phrases (e.g., "to generate the SAS URL automatically during later steps" vs. "so the SAS URL can be generated automatically during one of the later steps").
    • rationale: Improved parallelism and conciseness for better readability.
    • impact: Instructions become easier to follow and less verbose, reducing potential confusion.
  • categories: [Grammar, Clarity]

    • change: Modified step 3 from "Install packages needed to run the sample" to "Install the packages required to run the sample:" including the colon punctuation.
    • rationale: Added definite article for grammatical correctness and introduced colon to signal an upcoming list or important information.
    • impact: Provides clearer instruction and better formatting for ease of reading.

},
{
Expand All @@ -43,13 +42,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyzer template and local training folder set up\n",
"In this sample we define a template for receipts.\n",
"## Analyzer Template and Local Training Folder Setup\n",
"In this sample, we define a template for receipts.\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]
    • change: Capitalized "Analyzer Template and Local Training Folder Setup" and added a comma after "In this sample"
    • rationale: The title was changed to title case for consistency with headings, and a comma was added to improve sentence flow
    • impact: Enhances readability and maintains a consistent style throughout the documentation

"The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:\n",
"- The original file (e.g., PDF or image).\n",
"- A corresponding labels.json file with labeled fields.\n",
"- A corresponding result.json file with OCR results."
"- A corresponding `labels.json` file with labeled fields.\n",
"- A corresponding `result.json` file with OCR results."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Consistency, Formatting]
    • change: Added backticks around labels.json and result.json filenames.
    • rationale: Using backticks visually distinguishes filenames as code or file references, aligning with common documentation practices.
    • impact: Enhances readability and clarity by clearly indicating these are file names, improving the overall quality of the documentation.

},
{
Expand All @@ -66,15 +65,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create Azure content understanding client\n",
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n",
"## Create Azure Content Understanding Client\n",
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that contains helper functions. Before the official release of the Content Understanding SDK, please consider it a lightweight SDK.\n",
">\n",
"> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n",
"\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Typo Fix, Grammar, Clarity]

    • change: Corrected "utility class that contains the functions" to "a utility class that contains helper functions" and improved sentence structure for better readability.
    • rationale: The original sentence was awkward and contained grammatical errors; revising it clarifies the description of the client class and improves readability.
    • impact: Enhances user understanding of the AzureContentUnderstandingClient by providing a clearer and grammatically correct explanation.
  • categories: [Clarity, Formatting]

    • change: Split a long, run-on sentence about the SDK's status and filling in constants into two separate sentences and added a blank line for separation.
    • rationale: Separating ideas into distinct sentences avoids confusion and improves the document’s visual structure for easier scanning.
    • impact: Improves clarity by clearly differentiating the explanation about the SDK status from instructions related to configuration constants.

"> ⚠️ Important:\n",
"You must update the code below to match your Azure authentication method.\n",
"Look for the `# IMPORTANT` comments and modify those sections accordingly.\n",
"If you skip this step, the sample may not run correctly.\n",
"\n",
"> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments."
"> ⚠️ Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Grammar]
    • change: Reworded the note from "Using a subscription key works" to "While using a subscription key works," and replaced "is much safer" with "is safer" regarding the use of Azure Active Directory token providers.
    • rationale: The insertion of "While" creates a clearer contrast between the two options. Removing "much" simplifies the statement, making it more direct and professional in tone.
    • impact: This change improves the readability and clarity of the note, making the recommendation stronger and more concise for users.

},
{
Expand All @@ -91,7 +92,7 @@
"from dotenv import find_dotenv, load_dotenv\n",
"from azure.identity import DefaultAzureCredential, get_bearer_token_provider\n",
"\n",
"# import utility package from python samples root directory\n",
"# Import utility package from the Python samples root directory\n",
"parent_dir = Path(Path.cwd()).parent\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Consistency]
    • change: Capitalized the first word "import" and "python" was changed to "Python" in the comment.
    • rationale: Proper nouns like "Python" should be capitalized, and comments should start with a capital letter for grammatical correctness and consistency.
    • impact: Improves readability and professionalism of the comment by adhering to standard writing conventions.

"sys.path.append(str(parent_dir))\n",
"from python.content_understanding_client import AzureContentUnderstandingClient\n",
Expand All @@ -109,20 +110,20 @@
" token_provider=token_provider,\n",
" # IMPORTANT: Uncomment this if using subscription key\n",
" # subscription_key=os.getenv(\"AZURE_AI_API_KEY\"),\n",
" x_ms_useragent=\"azure-ai-content-understanding-python/analyzer_training\", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.\n",
" x_ms_useragent=\"azure-ai-content-understanding-python/analyzer_training\", # This header is used for sample usage telemetry; please comment out this line if you want to opt out.\n",
")"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar]
    • change: Replaced the comma with a semicolon in the comment between two independent clauses.
    • rationale: The semicolon correctly separates two related independent clauses, improving grammatical accuracy.
    • impact: Enhances the readability and professionalism of the comment by using proper punctuation.

]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare labeled data\n",
"In this step, we will\n",
"- Use `TRAINING_DATA_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n",
"- Try to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n",
"If this is not set, we attempt to generate the SAS URL automatically using the environment variables `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`.\n",
"- Verify that document files in the local folder have corresponding `.labels.json` and `.result.json` files\n",
"## Prepare Labeled Data\n",
"In this step, we will:\n",
"- Use the environment variables `TRAINING_DATA_PATH` and SAS URL related variables set in the Prerequisites step.\n",
"- Attempt to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n",
"- If `TRAINING_DATA_SAS_URL` is not set, try generating it automatically using `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` environment variables.\n",
"- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.\n",
"- Upload these files to the Azure Blob storage container specified by the environment variables."
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Consistency, Grammar, Formatting]

    • change: Changed the header from "Prepare labeled data" to "Prepare Labeled Data" with capitalization of key words and added a colon after "In this step, we will".
    • rationale: To maintain consistency in header formatting and improve grammatical structure by indicating a list follows.
    • impact: Enhances readability and aligns with common Markdown and documentation style conventions.
  • categories: [Clarity, Grammar]

    • change: Rephrased the bullet points for clarity, such as explicitly mentioning the environment variables being "set in the Prerequisites step", using "Attempt" instead of "Try to," and reorganizing sentences for smoother flow.
    • rationale: To make instructions clearer and more direct, reducing ambiguity and improving professional tone.
    • impact: Users better understand the sequence of actions and the environment variables involved, minimizing confusion.
  • categories: [Clarity]

    • change: Modified the phrasing about verifying document files to specify "each document file" has corresponding .labels.json and .result.json files.
    • rationale: To clearly indicate that verification applies on a per-file basis.
    • impact: Prevents misinterpretation, ensuring users correctly perform verification for all relevant files.

]
},
Expand All @@ -138,10 +139,10 @@
" TRAINING_DATA_CONTAINER_NAME = os.getenv(\"TRAINING_DATA_CONTAINER_NAME\")\n",
" if not TRAINING_DATA_STORAGE_ACCOUNT_NAME and not training_data_sas_url:\n",
" raise ValueError(\n",
" \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n",
" \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n",
" )\n",
" from azure.storage.blob import ContainerSasPermissions\n",
" # We will need \"Write\" for uploading, modifying, or appending blobs\n",
" # We require \"Read\", \"Write\", and \"List\" permissions for uploading, modifying, or listing blobs\n",
" training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Typo Fix]

    • change: Corrected the misspelled environment variable name from TRAINING_DATA_STORAGE_ACCCOUNT_NAME to TRAINING_DATA_STORAGE_ACCOUNT_NAME.
    • rationale: The original variable name contained an extra 'C', which would cause confusion or errors when users set environment variables.
    • impact: Improves accuracy and prevents potential user errors in environment variable configuration.
  • categories: [Clarity]

    • change: Expanded the comment to specify that "Read", "Write", and "List" permissions are required for blob operations instead of only mentioning "Write".
    • rationale: The original comment was incomplete regarding permissions, possibly leading to misunderstandings about the necessary access rights.
    • impact: Provides clearer guidance on required permissions, helping developers correctly configure access for uploading, modifying, or listing blobs.

" account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,\n",
" container_name=TRAINING_DATA_CONTAINER_NAME,\n",
Expand All @@ -158,10 +159,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create analyzer with defined schema\n",
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
"## Create Analyzer with Defined Schema\n",
"Before creating the analyzer, fill in the constant `ANALYZER_ID` with a relevant name for your task. In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.\n",
"\n",
"We use **training_data_sas_url** and **training_data_path** that's set up in the [.env](./.env) file and used in the previous step."
"We use **training_data_sas_url** and **training_data_path** as set in the [.env](./.env) file and used in the previous step."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Consistency, Formatting]

    • change: Changed the heading from "Create analyzer with defined schema" to "Create Analyzer with Defined Schema" (capitalizing major words).
    • rationale: To maintain consistency with title capitalization conventions, improving the visual structure of the document.
    • impact: Enhances readability and professionalism of the documentation by adhering to common heading formatting standards.
  • categories: [Grammar, Clarity]

    • change: Revised sentence from "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task." to "Before creating the analyzer, fill in the constant ANALYZER_ID with a relevant name for your task."
    • rationale: Removed unnecessary modal verb "should", added code formatting for ANALYZER_ID, and replaced "to your task" with clearer "for your task".
    • impact: Improves clarity and directness of instructions, while also enhancing readability with code formatting.
  • categories: [Clarity, Grammar]

    • change: Modified "Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers." to "In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers."
    • rationale: Replaced "Here" with "In this example" for clearer context; added "that" for grammatical correctness.
    • impact: Makes the explanation easier to understand and grammatically smoother.
  • categories: [Grammar, Clarity]

    • change: Changed "We use training_data_sas_url and training_data_path that's set up in the .env file and used in the previous step." to "We use training_data_sas_url and training_data_path as set in the .env file and used in the previous step."
    • rationale: Corrected subject-verb agreement by replacing "that's" (singular) with "as set" to match the plural subject and improve sentence flow.
    • impact: Avoids grammatical errors and improves readability.

},
{
Expand Down Expand Up @@ -194,8 +195,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use created analyzer to extract document content\n",
"After the analyzer is successfully created, we can use it to analyze our input files."
"## Use Created Analyzer to Extract Document Content\n",
"After the analyzer is successfully created, you can use it to analyze your input files."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Consistency, Grammar]
    • change: Capitalized each major word in the heading and changed "we can use" to "you can use" in the sentence.
    • rationale: Capitalizing the heading improves consistency with common documentation styles, and switching to "you" creates a more direct and reader-focused tone.
    • impact: Enhances readability and aligns the documentation with standard style conventions, making instructions clearer and more engaging for the user.

},
{
Expand All @@ -214,8 +215,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Delete exist analyzer in Content Understanding Service\n",
"This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse."
"## Delete Existing Analyzer in Content Understanding Service\n",
"This snippet is optional and is included to prevent test analyzers from remaining in your service. Without deletion, the analyzer will stay in your service and may be reused in subsequent operations."
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Typo Fix, Grammar, Clarity]
    • change: Corrected "Delete exist analyzer" to "Delete Existing Analyzer" and rephrased the description for better readability and grammar.
    • rationale: The original phrase contained a typo ("exist" instead of "existing") and the description was awkwardly worded. The revision improves grammatical correctness and clarifies the purpose and optional nature of the snippet.
    • impact: Enhances professionalism and legibility of the documentation, making it easier for users to understand the snippet's intent and optional usage.

},
{
Expand Down Expand Up @@ -249,4 +250,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Formatting]
    • change: Added a newline character after a closing brace (}) that previously had none.
    • rationale: Ensures consistent file formatting by ending the file with a newline, adhering to common coding standards.
    • impact: Prevents potential issues with tools or editors that expect a newline at the end of the file and improves overall codebase consistency.