Skip to content

Commit 450d2b5

Browse files
Add SAS URL Auto Generation (#60)
1 parent edfaa21 commit 450d2b5

File tree

4 files changed

+155
-49
lines changed

4 files changed

+155
-49
lines changed

docs/set_env_for_training_data_and_reference_doc.md

Lines changed: 37 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,47 @@ Folders [document_training](../data/document_training/) and [field_extraction_pr
66
2. *Install Azure Storage Explorer:* Azure Storage Explorer is a tool which makes it easy to work with Azure Storage data. Install it and login with your credential, follow the [guide](https://aka.ms/download-and-install-Azure-Storage-Explorer).
77
3. *Create or Choose a Blob Container:* Create a blob container from Azure Storage Explorer or use an existing one.
88
<img src="./create-blob-container.png" width="600" />
9-
4. *Generate a Shared Access Signature (SAS) URL:*
10-
- Right-click on blob container and select the `Get Shared Access Signature...` in the menu.
11-
- Check the required permissions: `Read`, `Write` and `List`
12-
- Click the `Create` button.
13-
<img src="./get-access-signature.png" height="600" /> <img src="./choose-signature-options.png" height="600" />
14-
5. *Copy the SAS URL:* After creating the SAS, click `Copy` to get the URL with token. This will be used as the value for **TRAINING_DATA_SAS_URL** or **REFERENCE_DOC_SAS_URL** when running the sample code.
15-
<img src="./copy-access-signature.png" width="600" />
16-
6. *Set Environment Variables in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env).
17-
> NOTE: **REFERENCE_DOC_SAS_URL** can be the same as the **TRAINING_DATA_SAS_URL** to re-use the same blob container
18-
- [analyzer_training](../notebooks/analyzer_training.ipynb): Add the SAS URL as value of **TRAINIGN_DATA_SAS_URL**, and a prefix for **TRAINING_DATA_PATH**. You can choose any folder name you like for **TRAINING_DATA_PATH**. For example, you could use "training_files".
9+
4. *Set SAS URL Related Environment Variables in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env). There are two options to set up environment variables to utilize required Shared Access Signature (SAS) URL.
10+
- Option A - Generate a SAS URL manually on Azure Storage Explorer
11+
- Right-click on blob container and select the `Get Shared Access Signature...` in the menu.
12+
- Check the required permissions: `Read`, `Write` and `List`
13+
- We will need `Write` for uploading, modifying, or appending blobs
14+
- Click the `Create` button.
15+
<img src="./get-access-signature.png" height="600" /> <img src="./choose-signature-options.png" height="600" />
16+
- *Copy the SAS URL:* After creating the SAS, click `Copy` to get the URL with token. This will be used as the value for **TRAINING_DATA_SAS_URL** or **REFERENCE_DOC_SAS_URL** when running the sample code.
17+
<img src="./copy-access-signature.png" width="600" />
18+
19+
- Set the following in [.env](../notebooks/.env).
20+
> NOTE: **REFERENCE_DOC_SAS_URL** can be the same as the **TRAINING_DATA_SAS_URL** to re-use the same blob container
21+
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the SAS URL as value of **TRAINIGN_DATA_SAS_URL**.
22+
```env
23+
TRAINING_DATA_SAS_URL=<Blob container SAS URL>
24+
```
25+
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the SAS URL as value of **REFERENCE_DOC_SAS_URL**.
26+
```env
27+
REFERENCE_DOC_SAS_URL=<Blob container SAS URL>
28+
```
29+
- Option B - Auto-generate the SAS URL via code in sample notebooks
30+
- Instead of manually creating a SAS URL, you can set storage account and container information, and let the code generate a temporary SAS URL at runtime.
31+
> NOTE: **TRAINING_DATA_STORAGE_ACCOUNT_NAME** and **TRAINING_DATA_CONTAINER_NAME** can be the same as the **REFERENCE_DOC_STORAGE_ACCOUNT_NAME** and **REFERENCE_DOC_CONTAINER_NAME** to re-use the same blob container
32+
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the storage account name as `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `TRAINING_DATA_CONTAINER_NAME`.
33+
```env
34+
TRAINING_DATA_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
35+
TRAINING_DATA_CONTAINER_NAME=<your-container-name>
36+
```
37+
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the storage account name as `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `REFERENCE_DOC_CONTAINER_NAME`.
38+
```env
39+
REFERENCE_DOC_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
40+
REFERENCE_DOC_CONTAINER_NAME=<your-container-name>
41+
```
42+
43+
5. *Set Folder Prefix in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env).
44+
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add a prefix for **TRAINING_DATA_PATH**. You can choose any folder name you like for **TRAINING_DATA_PATH**. For example, you could use "training_files".
1945
```env
20-
TRAINING_DATA_SAS_URL=<Blob container SAS URL>
2146
TRAINING_DATA_PATH=<Designated folder path under the blob container>
2247
```
23-
- [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the SAS URL as value of **REFERENCE_DOC_SAS_URL**, and a prefix for **REFERENCE_DOC_PATH**. You can choose any folder name you like for **REFERENCE_DOC_PATH**. For example, you could use "reference_docs".
48+
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add a prefix for **REFERENCE_DOC_PATH**. You can choose any folder name you like for **REFERENCE_DOC_PATH**. For example, you could use "reference_docs".
2449
```env
25-
REFERENCE_DOC_SAS_URL=<Blob container SAS URL>
2650
REFERENCE_DOC_PATH=<Designated folder path under the blob container>
2751
```
2852

notebooks/analyzer_training.ipynb

Lines changed: 34 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,11 @@
2323
"\n",
2424
"## Prerequisites\n",
2525
"1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
26-
"1. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related env variables `TRAINING_DATA_SAS_URL` and `TRAINING_DATA_PATH` into the [.env](./.env) file.\n",
27-
" - `TRAINING_DATA_SAS_URL`: SAS URL for your Azure Blob container. \n",
28-
" - `TRAINING_DATA_PATH`: Folder path within the container to upload training data. \n",
29-
"1. Install packages needed to run the sample\n",
30-
"\n",
31-
"\n"
26+
"2. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related environment variables into the [.env](./.env) file.\n",
27+
" - You can either set `TRAINING_DATA_SAS_URL` directly with the SAS URL for your Azure Blob container,\n",
28+
" - Or set both `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n",
29+
" - Also set `TRAINING_DATA_PATH` to specify the folder path within the container where training data will be uploaded.\n",
30+
"3. Install packages needed to run the sample\n"
3231
]
3332
},
3433
{
@@ -119,11 +118,12 @@
119118
"metadata": {},
120119
"source": [
121120
"## Prepare labeled data\n",
122-
"In this step, we will \n",
123-
"- Check whether document files in local folder have corresponding `.labels.json` and `.result.json` files\n",
124-
"- Upload these files to the designated Azure blob storage.\n",
125-
"\n",
126-
"We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set in the Prerequisites step."
121+
"In this step, we will\n",
122+
"- Use `TRAINING_DATA_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n",
123+
"- Try to get the SAS URL from the environment variable `TRAINING_DATA_SAS_URL`.\n",
124+
"If this is not set, we attempt to generate the SAS URL automatically using the environment variables `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and `TRAINING_DATA_CONTAINER_NAME`.\n",
125+
"- Verify that document files in the local folder have corresponding `.labels.json` and `.result.json` files\n",
126+
"- Upload these files to the Azure Blob storage container specified by the environment variables."
127127
]
128128
},
129129
{
@@ -132,10 +132,26 @@
132132
"metadata": {},
133133
"outputs": [],
134134
"source": [
135-
"TRAINING_DATA_SAS_URL = os.getenv(\"TRAINING_DATA_SAS_URL\")\n",
136-
"TRAINING_DATA_PATH = os.getenv(\"TRAINING_DATA_PATH\")\n",
137-
"\n",
138-
"await client.generate_training_data_on_blob(training_docs_folder, TRAINING_DATA_SAS_URL, TRAINING_DATA_PATH)"
135+
"training_data_sas_url = os.getenv(\"TRAINING_DATA_SAS_URL\")\n",
136+
"if not training_data_sas_url:\n",
137+
" TRAINING_DATA_STORAGE_ACCOUNT_NAME = os.getenv(\"TRAINING_DATA_STORAGE_ACCOUNT_NAME\")\n",
138+
" TRAINING_DATA_CONTAINER_NAME = os.getenv(\"TRAINING_DATA_CONTAINER_NAME\")\n",
139+
" if not TRAINING_DATA_STORAGE_ACCOUNT_NAME and not training_data_sas_url:\n",
140+
" raise ValueError(\n",
141+
" \"Please set either TRAINING_DATA_SAS_URL or both TRAINING_DATA_STORAGE_ACCCOUNT_NAME and TRAINING_DATA_CONTAINER_NAME environment variables.\"\n",
142+
" )\n",
143+
" from azure.storage.blob import ContainerSasPermissions\n",
144+
" # We will need \"Write\" for uploading, modifying, or appending blobs\n",
145+
" training_data_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n",
146+
" account_name=TRAINING_DATA_STORAGE_ACCOUNT_NAME,\n",
147+
" container_name=TRAINING_DATA_CONTAINER_NAME,\n",
148+
" permissions=ContainerSasPermissions(read=True, write=True, list=True),\n",
149+
" expiry_hours=1,\n",
150+
" )\n",
151+
"\n",
152+
"training_data_path = os.getenv(\"TRAINING_DATA_PATH\")\n",
153+
"\n",
154+
"await client.generate_training_data_on_blob(training_docs_folder, training_data_sas_url, training_data_path)"
139155
]
140156
},
141157
{
@@ -145,7 +161,7 @@
145161
"## Create analyzer with defined schema\n",
146162
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
147163
"\n",
148-
"We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set up in the [.env](./.env) file and used in the previous step."
164+
"We use **training_data_sas_url** and **training_data_path** that's set up in the [.env](./.env) file and used in the previous step."
149165
]
150166
},
151167
{
@@ -160,8 +176,8 @@
160176
"response = client.begin_create_analyzer(\n",
161177
" CUSTOM_ANALYZER_ID,\n",
162178
" analyzer_template_path=analyzer_template,\n",
163-
" training_storage_container_sas_url=TRAINING_DATA_SAS_URL,\n",
164-
" training_storage_container_path_prefix=TRAINING_DATA_PATH,\n",
179+
" training_storage_container_sas_url=training_data_sas_url,\n",
180+
" training_storage_container_path_prefix=training_data_path,\n",
165181
")\n",
166182
"result = client.poll_result(response)\n",
167183
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",

notebooks/field_extraction_pro_mode.ipynb

Lines changed: 32 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,10 @@
2828
"source": [
2929
"## Prerequisites\n",
3030
"1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
31-
"1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.\n",
32-
" - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container.\n",
33-
" - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs.\n",
31+
"1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up reference document related environment variables in the [.env](./.env) file.\n",
32+
" - You can either set `REFERENCE_DOC_SAS_URL` directly with the SAS URL for your Azure Blob container,\n",
33+
" - Or set both `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`, so the SAS URL can be generated automatically during one of the later steps.\n",
34+
" - Also set `REFERENCE_DOC_PATH` to specify the folder path within the container where reference documents will be uploaded.\n",
3435
" > ⚠️ Note: Reference documents are optional in Pro mode. You can run Pro mode using just input documents. For example, the service can reason across two or more input files even without any reference data.\n",
3536
"1. Install the required packages to run the sample."
3637
]
@@ -157,12 +158,12 @@
157158
"source": [
158159
"## Prepare reference data\n",
159160
"In this step, we will \n",
161+
"- Use `REFERENCE_DOC_PATH` and SAS URL related environment variables that were set in the Prerequisites step.\n",
162+
"- Try to get the SAS URL from the environment variable `REFERENCE_DOC_SAS_URL`.\n",
163+
"If this is not set, we attempt to generate the SAS URL automatically using the environment variables `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and `REFERENCE_DOC_CONTAINER_NAME`.\n",
160164
"- Use Azure AI service to Extract OCR results from reference documents (if needed).\n",
161165
"- Generate a reference `.jsonl` file.\n",
162-
"- Upload these files to the designated Azure blob storage.\n",
163-
"\n",
164-
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set in the Prerequisites step.\n",
165-
"\n"
166+
"- Upload these files to the designated Azure blob storage.\n"
166167
]
167168
},
168169
{
@@ -172,8 +173,21 @@
172173
"outputs": [],
173174
"source": [
174175
"# Load reference storage configuration from environment\n",
175-
"REFERENCE_DOC_SAS_URL = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n",
176-
"REFERENCE_DOC_PATH = os.getenv(\"REFERENCE_DOC_PATH\")"
176+
"reference_doc_path = os.getenv(\"REFERENCE_DOC_PATH\")\n",
177+
"\n",
178+
"reference_doc_sas_url = os.getenv(\"REFERENCE_DOC_SAS_URL\")\n",
179+
"if not reference_doc_sas_url:\n",
180+
" REFERENCE_DOC_STORAGE_ACCOUNT_NAME = os.getenv(\"REFERENCE_DOC_STORAGE_ACCOUNT_NAME\")\n",
181+
" REFERENCE_DOC_CONTAINER_NAME = os.getenv(\"REFERENCE_DOC_CONTAINER_NAME\")\n",
182+
" if REFERENCE_DOC_STORAGE_ACCOUNT_NAME and REFERENCE_DOC_CONTAINER_NAME:\n",
183+
" from azure.storage.blob import ContainerSasPermissions\n",
184+
" # We will need \"Write\" for uploading, modifying, or appending blobs\n",
185+
" reference_doc_sas_url = AzureContentUnderstandingClient.generate_temp_container_sas_url(\n",
186+
" account_name=REFERENCE_DOC_STORAGE_ACCOUNT_NAME,\n",
187+
" container_name=REFERENCE_DOC_CONTAINER_NAME,\n",
188+
" permissions=ContainerSasPermissions(read=True, write=True, list=True),\n",
189+
" expiry_hours=1,\n",
190+
" )"
177191
]
178192
},
179193
{
@@ -193,7 +207,7 @@
193207
"# Please name the OCR result files with the same name as the original document files including its extension, and add the suffix \".result.json\"\n",
194208
"# For example, if the original document is \"invoice.pdf\", the OCR result file should be named \"invoice.pdf.result.json\"\n",
195209
"# NOTE: Please comment out the follwing line if you don't have any reference documents.\n",
196-
"await client.generate_knowledge_base_on_blob(reference_docs, REFERENCE_DOC_SAS_URL, REFERENCE_DOC_PATH, skip_analyze=False)"
210+
"await client.generate_knowledge_base_on_blob(reference_docs, reference_doc_sas_url, reference_doc_path, skip_analyze=False)"
197211
]
198212
},
199213
{
@@ -203,7 +217,7 @@
203217
"## Create analyzer with defined schema for Pro mode\n",
204218
"Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
205219
"\n",
206-
"We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the [.env](./.env) file and used in the previous step."
220+
"We use **reference_doc_sas_url** and **reference_doc_path** that's set up in the [.env](./.env) file and used in the previous step."
207221
]
208222
},
209223
{
@@ -218,8 +232,8 @@
218232
"response = client.begin_create_analyzer(\n",
219233
" CUSTOM_ANALYZER_ID,\n",
220234
" analyzer_template_path=analyzer_template,\n",
221-
" pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL,\n",
222-
" pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH,\n",
235+
" pro_mode_reference_docs_storage_container_sas_url=reference_doc_sas_url,\n",
236+
" pro_mode_reference_docs_storage_container_path_prefix=reference_doc_path,\n",
223237
")\n",
224238
"result = client.poll_result(response)\n",
225239
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",
@@ -332,8 +346,7 @@
332346
"reference_docs_2 = \"../data/field_extraction_pro_mode/insurance_claims_review/reference_docs\"\n",
333347
"\n",
334348
"# Load reference storage configuration from environment\n",
335-
"REFERENCE_DOC_SAS_URL_2 = os.getenv(\"REFERENCE_DOC_SAS_URL\") # Reuse the same blob container\n",
336-
"REFERENCE_DOC_PATH_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\" # NOTE: Use a different path for the second sample\n",
349+
"reference_doc_path_2 = os.getenv(\"REFERENCE_DOC_PATH\").rstrip(\"/\") + \"_2/\" # NOTE: Use a different path for the second sample\n",
337350
"CUSTOM_ANALYZER_ID_2 = \"pro-mode-sample-\" + str(uuid.uuid4())"
338351
]
339352
},
@@ -352,7 +365,8 @@
352365
"outputs": [],
353366
"source": [
354367
"logging.info(\"Start generating knowledge base for the second sample...\")\n",
355-
"await client.generate_knowledge_base_on_blob(reference_docs_2, REFERENCE_DOC_SAS_URL_2, REFERENCE_DOC_PATH_2, skip_analyze=True)"
368+
"# Reuse the same blob container\n",
369+
"await client.generate_knowledge_base_on_blob(reference_docs_2, reference_doc_sas_url, reference_doc_path_2, skip_analyze=True)"
356370
]
357371
},
358372
{
@@ -372,8 +386,8 @@
372386
"response = client.begin_create_analyzer(\n",
373387
" CUSTOM_ANALYZER_ID_2,\n",
374388
" analyzer_template_path=analyzer_template_2,\n",
375-
" pro_mode_reference_docs_storage_container_sas_url=REFERENCE_DOC_SAS_URL_2,\n",
376-
" pro_mode_reference_docs_storage_container_path_prefix=REFERENCE_DOC_PATH_2,\n",
389+
" pro_mode_reference_docs_storage_container_sas_url=reference_doc_sas_url,\n",
390+
" pro_mode_reference_docs_storage_container_path_prefix=reference_doc_path_2,\n",
377391
")\n",
378392
"result = client.poll_result(response)\n",
379393
"if result is not None and \"status\" in result and result[\"status\"] == \"Succeeded\":\n",

0 commit comments

Comments
 (0)