|
28 | 28 | "source": [
|
29 | 29 | "## Prerequisites\n",
|
30 | 30 | "1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
|
31 |
| - "1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.\n", |
32 |
| - " - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.\n", |
33 |
| - " - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs.\n", |
| 31 | + "1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up reference document related environment variables in the [.env](./.env) file.\n", |
| 32 | + " - You can either set `REFERENCE_DOC_SAS_URL` directly with the SAS URL for your Azure Blob container,\n", |
| 33 | + " - Or set both `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n", |
| 34 | + " - Also set `REFERENCE_DOC_PATH` to specify the folder path within the container where reference documents will be uploaded.\n", |
34 | 35 | " > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data.\n",
|
35 | 36 | "1. Install the required packages to run the sample."
|
36 | 37 | ]
|
|
157 | 158 | "source": [
|
158 | 159 | "## Prepare reference data\n",
|
159 | 160 | "In this step, we will \n",
|
| 161 | + "- Use `REFERENCE_DOC_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n", |
| 162 | + "- Try to get the SAS URL from the environment variable `REFERENCE_DOC_SAS_URL`.\n", |
| 163 | + "If this is not set, we attempt to generate the SAS URL automatically using the environment variables `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`.\n", |
160 | 164 | "- Use Azure AI service to Extract OCR results from reference documents (if needed).\n",
|
161 | 165 | "- Generate a reference `.jsonl` file.\n",
|
162 |
| - "- Upload these files to the designated Azure blob storage.\n", |
163 |
| - "\n", |
164 |
| - "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step.\n", |
165 |
| - "\n" |
| 166 | + "- Upload these files to the designated Azure blob storage.\n" |
166 | 167 | ]
|
167 | 168 | },
|
168 | 169 | {
|
|
172 | 173 | "outputs": [],
|
173 | 174 | "source": [
|
174 | 175 | "# Load reference storage configuration from environment\n",
|
175 |
| - "REFERENCE_DOC_SAS_URL = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n", |
176 |
| - "REFERENCE_DOC_PATH = os.getenv(\"REFERENCE_DOC_PATH\")" |
| 176 | + "reference_doc_path = os.getenv(\"REFERENCE_DOC_PATH\")\n", |
| 177 | + "\n", |
| 178 | + "reference_doc_sas_url = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n", |
| 179 | + "if not reference_doc_sas_url:\n", |
| 180 | + " REFERENCE_DOC_STORAGE_ACCOUNT_NAME = os.getenv(\"REFERENCE_DOC_STORAGE_ACCOUNT_NAME\")\n", |
| 181 | + " REFERENCE_DOC_CONTAINER_NAME = os.getenv(\"REFERENCE_DOC_CONTAINER_NAME\")\n", |
| 182 | + " if REFERENCE_DOC_STORAGE_ACCOUNT_NAME and REFERENCE_DOC_CONTAINER_NAME:\n", |
| 183 | + " from azure.storage.blob import ContainerSasPermissions\n", |
| 184 | + " # We will need \"Write\" for uploading, modifying, or appending blobs\n", |
| 185 | + " reference_doc_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n", |
| 186 | + " account_name=REFERENCE_DOC_STORAGE_ACCOUNT_NAME,\n", |
| 187 | + " container_name=REFERENCE_DOC_CONTAINER_NAME,\n", |
| 188 | + " permissions=ContainerSasPermissions(read=True, write=True, list=True),\n", |
| 189 | + " expiry_hours=1,\n", |
| 190 | + " )" |
177 | 191 | ]
|
178 | 192 | },
|
179 | 193 | {
|
|
193 | 207 | "# Please name the OCR result files with the same name as the original document files including its extension, and add the suffix \".result.json\"\n",
|
194 | 208 | "# For example, if the original document is \"invoice.pdf\", the OCR result file should be named \"invoice.pdf.result.json\"\n",
|
195 | 209 | "# NOTE: Please comment out the follwing line if you don't have any reference documents.\n",
|
196 |
| - "await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)" |
| 210 | + "await client.generate_knowledge_base_on_blob(reference_docs, reference_doc_sas_url, reference_doc_path, skip_analyze=False)" |
197 | 211 | ]
|
198 | 212 | },
|
199 | 213 | {
|
|
203 | 217 | "## Create analyzer with defined schema for Pro mode\n",
|
204 | 218 | "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
|
205 | 219 | "\n",
|
206 |
| - "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the [.env](./.env) file and used in the previous step." |
| 220 | + "We use **reference_doc_sas_url** and **reference_doc_path** that's set up in the [.env](./.env) file and used in the previous step." |
207 | 221 | ]
|
208 | 222 | },
|
209 | 223 | {
|
|
218 | 232 | "response = client.begin_create_analyzer(\n",
|
219 | 233 | " CUSTOM_ANALYZER_ID,\n",
|
220 | 234 | " analyzer_template_path=analyzer_template,\n",
|
221 |
| - " pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL,\n", |
222 |
| - " pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH,\n", |
| 235 | + " pro_mode_reference_docs_storage_container_sas_url=reference_doc_sas_url,\n", |
| 236 | + " pro_mode_reference_docs_storage_container_path_prefix=reference_doc_path,\n", |
223 | 237 | ")\n",
|
224 | 238 | "result = client.poll_result(response)\n",
|
225 | 239 | "if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
|
|
332 | 346 | "reference_docs_2 = \"../data/field_extraction_pro_mode/insurance_claims_review/reference_docs\"\n",
|
333 | 347 | "\n",
|
334 | 348 | "# Load reference storage configuration from environment\n",
|
335 |
| - "REFERENCE_DOC_SAS_URL_2 = os.getenv(\"REFERENCE_DOC_SAS_URL\") # Reuse the same blob container\n", |
336 |
| - "REFERENCE_DOC_PATH_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\" # NOTE: Use a different path for the second sample\n", |
| 349 | + "reference_doc_path_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\" # NOTE: Use a different path for the second sample\n", |
337 | 350 | "CUSTOM_ANALYZER_ID_2 = \"pro-mode-sample-\" + str(uuid.uuid4())"
|
338 | 351 | ]
|
339 | 352 | },
|
|
352 | 365 | "outputs": [],
|
353 | 366 | "source": [
|
354 | 367 | "logging.info(\"Start generating knowledge base for the second sample...\")\n",
|
355 |
| - "await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)" |
| 368 | + "# Reuse the same blob container\n", |
| 369 | + "await client.generate_knowledge_base_on_blob(reference_docs_2, reference_doc_sas_url, reference_doc_path_2, skip_analyze=True)" |
356 | 370 | ]
|
357 | 371 | },
|
358 | 372 | {
|
|
372 | 386 | "response = client.begin_create_analyzer(\n",
|
373 | 387 | " CUSTOM_ANALYZER_ID_2,\n",
|
374 | 388 | " analyzer_template_path=analyzer_template_2,\n",
|
375 |
| - " pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL_2,\n", |
376 |
| - " pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH_2,\n", |
| 389 | + " pro_mode_reference_docs_storage_container_sas_url=reference_doc_sas_url,\n", |
| 390 | + " pro_mode_reference_docs_storage_container_path_prefix=reference_doc_path_2,\n", |
377 | 391 | ")\n",
|
378 | 392 | "result = client.poll_result(response)\n",
|
379 | 393 | "if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
|
|
0 commit comments