-
Notifications
You must be signed in to change notification settings - Fork 28
Review main-notebooks/analyzer_training.ipynb
#73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,30 +4,29 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Enhance your analyzer with labeled data\n", | ||
"# Enhance Your Analyzer with Labeled Data\n", | ||
"\n", | ||
"\n", | ||
"> #################################################################################\n", | ||
">\n", | ||
"> Note: Currently this feature is only available for analyzer scenario is `document`\n", | ||
"> Note: Currently, this feature is only available when the analyzer scenario is set to `document`.\n", | ||
">\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"> #################################################################################\n", | ||
"\n", | ||
"Labeled data is a group of samples that have been tagged with one or more labels to add context or meaning, which is used to improve analyzer's performance.\n", | ||
"Labeled data consists of samples that have been tagged with one or more labels to add context or meaning. This additional information is used to improve the analyzer's performance.\n", | ||
"\n", | ||
"In your own project, you will use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to use the labeling tool to annotate your data.\n", | ||
"\n", | ||
"In this notebook we will demonstrate after you have the labeled data, how to create analyzer with them and analyze your files.\n", | ||
"In your own projects, you can use [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/quickstart/use-ai-foundry) to annotate your data with the labeling tool.\n", | ||
"\n", | ||
"This notebook demonstrates how to create an analyzer using your labeled data and how to analyze your files afterward.\n", | ||
"\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"\n", | ||
"## Prerequisites\n", | ||
"1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n", | ||
"2. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related environment variables into the [.env](./.env) file.\n", | ||
" - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n", | ||
" - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n", | ||
" - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where training data will be uploaded.\n", | ||
"3. Install packages needed to run the sample\n" | ||
"1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).\n", | ||
"2. Set environment variables related to training data by following the steps in [Set env for training data](../docs/set_env_for_training_data_and_reference_doc.md) and adding them to the [.env](./.env) file.\n", | ||
" - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n", | ||
" - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` to generate the SAS URL automatically during later steps.\n", | ||
" - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where the training data will be uploaded.\n", | ||
"3. Install the packages required to run the sample:\n" | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
}, | ||
{ | ||
|
@@ -43,13 +42,13 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Analyzer template and local training folder set up\n", | ||
"In this sample we define a template for receipts.\n", | ||
"## Analyzer Template and Local Training Folder Setup\n", | ||
"In this sample, we define a template for receipts.\n", | ||
"\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"The training folder should contain a flat (one-level) directory of labeled receipt documents. Each document includes:\n", | ||
"- The original file (e.g., PDF or image).\n", | ||
"- A corresponding labels.json file with labeled fields.\n", | ||
"- A corresponding result.json file with OCR results." | ||
"- A corresponding `labels.json` file with labeled fields.\n", | ||
"- A corresponding `result.json` file with OCR results." | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
}, | ||
{ | ||
|
@@ -66,15 +65,17 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create Azure content understanding client\n", | ||
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is utility class that contains the functions, Before the release of the Content Understanding SDK, please consider it a lightweight SDK., Fill in values for the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n", | ||
"## Create Azure Content Understanding Client\n", | ||
"> The [AzureContentUnderstandingClient](../python/content_understanding_client.py) is a utility class that contains helper functions. Before the official release of the Content Understanding SDK, please consider it a lightweight SDK.\n", | ||
">\n", | ||
"> Fill in the constants **AZURE_AI_ENDPOINT**, **AZURE_AI_API_VERSION**, and **AZURE_AI_API_KEY** with the information from your Azure AI Service.\n", | ||
"\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"> ⚠️ Important:\n", | ||
"You must update the code below to match your Azure authentication method.\n", | ||
"Look for the `# IMPORTANT` comments and modify those sections accordingly.\n", | ||
"If you skip this step, the sample may not run correctly.\n", | ||
"\n", | ||
"> ⚠️ Note: Using a subscription key works, but using a token provider with Azure Active Directory (AAD) is much safer and is highly recommended for production environments." | ||
"> ⚠️ Note: While using a subscription key works, using a token provider with Azure Active Directory (AAD) is safer and highly recommended for production environments." | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
}, | ||
{ | ||
|
@@ -91,7 +92,7 @@ | |
"from dotenv import find_dotenv, load_dotenv\n", | ||
"from azure.identity import DefaultAzureCredential, get_bearer_token_provider\n", | ||
"\n", | ||
"# import utility package from python samples root directory\n", | ||
"# Import utility package from the Python samples root directory\n", | ||
"parent_dir = Path(Path.cwd()).parent\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
"sys.path.append(str(parent_dir))\n", | ||
"from python.content_understanding_client import AzureContentUnderstandingClient\n", | ||
|
@@ -109,20 +110,20 @@ | |
" token_provider=token_provider,\n", | ||
" # IMPORTANT: Uncomment this if using subscription key\n", | ||
" # subscription_key=os.getenv(\"AZURE_AI_API_KEY\"),\n", | ||
" x_ms_useragent=\"azure-ai-content-understanding-python/analyzer_training\", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.\n", | ||
" x_ms_useragent=\"azure-ai-content-understanding-python/analyzer_training\", # This header is used for sample usage telemetry; please comment out this line if you want to opt out.\n", | ||
")" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Prepare labeled data\n", | ||
"In this step, we will\n", | ||
"- Use `TRAINING_DATA_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n", | ||
"- Try to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n", | ||
"If this is not set, we attempt to generate the SAS URL automatically using the environment variables `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`.\n", | ||
"- Verify that document files in the local folder have corresponding `.labels.json` and `.result.json` files\n", | ||
"## Prepare Labeled Data\n", | ||
"In this step, we will:\n", | ||
"- Use the environment variables `TRAINING_DATA_PATH` and SAS URL related variables set in the Prerequisites step.\n", | ||
"- Attempt to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n", | ||
"- If `TRAINING_DATA_SAS_URL` is not set, try generating it automatically using `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME` environment variables.\n", | ||
"- Verify that each document file in the local folder has corresponding `.labels.json` and `.result.json` files.\n", | ||
"- Upload these files to the Azure Blob storage container specified by the environment variables." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
] | ||
}, | ||
|
@@ -138,10 +139,10 @@ | |
" TRAINING_DATA_CONTAINER_NAME = os.getenv(\"TRAINING_DATA_CONTAINER_NAME\")\n", | ||
" if not TRAINING_DATA_STORAGE_ACCOUNT_NAME and not training_data_sas_url:\n", | ||
" raise ValueError(\n", | ||
" \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n", | ||
" \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n", | ||
" )\n", | ||
" from azure.storage.blob import ContainerSasPermissions\n", | ||
" # We will need \"Write\" for uploading, modifying, or appending blobs\n", | ||
" # We require \"Read\", \"Write\", and \"List\" permissions for uploading, modifying, or listing blobs\n", | ||
" training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
" account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,\n", | ||
" container_name=TRAINING_DATA_CONTAINER_NAME,\n", | ||
|
@@ -158,10 +159,10 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create analyzer with defined schema\n", | ||
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n", | ||
"## Create Analyzer with Defined Schema\n", | ||
"Before creating the analyzer, fill in the constant `ANALYZER_ID` with a relevant name for your task. In this example, we generate a unique suffix so that this cell can be run multiple times to create different analyzers.\n", | ||
"\n", | ||
"We use **training_data_sas_url** and **training_data_path** that's set up in the [.env](./.env) file and used in the previous step." | ||
"We use **training_data_sas_url** and **training_data_path** as set in the [.env](./.env) file and used in the previous step." | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
}, | ||
{ | ||
|
@@ -194,8 +195,8 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Use created analyzer to extract document content\n", | ||
"After the analyzer is successfully created, we can use it to analyze our input files." | ||
"## Use Created Analyzer to Extract Document Content\n", | ||
"After the analyzer is successfully created, you can use it to analyze your input files." | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
}, | ||
{ | ||
|
@@ -214,8 +215,8 @@ | |
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Delete exist analyzer in Content Understanding Service\n", | ||
"This snippet is not required, but it's only used to prevent the testing analyzer from residing in your service. Without deletion, the analyzer will remain in your service for subsequent reuse." | ||
"## Delete Existing Analyzer in Content Understanding Service\n", | ||
"This snippet is optional and is included to prevent test analyzers from remaining in your service. Without deletion, the analyzer will stay in your service and may be reused in subsequent operations." | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
}, | ||
{ | ||
|
@@ -249,4 +250,4 @@ | |
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.