Skip to content

Review main-docs/set_env_for_training_data_and_reference_doc.md #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 41 additions & 31 deletions docs/set_env_for_training_data_and_reference_doc.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,65 @@
# Set env variables for training data and reference doc for Pro mode
Folders [document_training](../data/document_training/) and [field_extraction_pro_mode](../data/field_extraction_pro_mode) contain the manually labeled data for training and reference doc for Pro mode as a quick sample. Before using these knowledge source files, you need an Azure Storage blob container to store them. Let's follow below steps to prepare the data environment:

1. *Create an Azure Storage Account:* If you don’t already have one, follow the guide to [create an Azure Storage Account](https://aka.ms/create-a-storage-account).
> If you already have an account, you can skip this step.
2. *Install Azure Storage Explorer:* Azure Storage Explorer is a tool which makes it easy to work with Azure Storage data. Install it and login with your credential, follow the [guide](https://aka.ms/download-and-install-Azure-Storage-Explorer).
3. *Create or Choose a Blob Container:* Create a blob container from Azure Storage Explorer or use an existing one.
<img src="./create-blob-container.png" width="600" />
4. *Set SAS URL Related Environment Variables in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env). There are two options to set up environment variables to utilize required Shared Access Signature (SAS) URL.
- Option A - Generate a SAS URL manually on Azure Storage Explorer
- Right-click on blob container and select the `Get Shared Access Signature...` in the menu.
- Check the required permissions: `Read`, `Write` and `List`
- We will need `Write` for uploading, modifying, or appending blobs
- Click the `Create` button.
<img src="./get-access-signature.png" height="600" /> <img src="./choose-signature-options.png" height="600" />
- *Copy the SAS URL:* After creating the SAS, click `Copy` to get the URL with token. This will be used as the value for **TRAINING_DATA_SAS_URL** or **REFERENCE_DOC_SAS_URL** when running the sample code.
# Set Environment Variables for Training Data and Reference Documents in Pro Mode

The folders [document_training](../data/document_training/) and [field_extraction_pro_mode](../data/field_extraction_pro_mode) contain manually labeled data used for training and reference documents in Pro mode as quick samples. Before using these knowledge source files, you need an Azure Storage blob container to store them. Follow the steps below to prepare your data environment:

1. **Create an Azure Storage Account:**
If you don’t already have one, follow the guide to [create an Azure Storage Account](https://aka.ms/create-a-storage-account).
> If you already have an account, you can skip this step.

2. **Install Azure Storage Explorer:**
Azure Storage Explorer is a tool that simplifies working with Azure Storage data. Install it and log in with your credentials by following the [installation guide](https://aka.ms/download-and-install-Azure-Storage-Explorer).

3. **Create or Choose a Blob Container:**
Using Azure Storage Explorer, create a new blob container or select an existing one.
<img src="./create-blob-container.png" width="600" />

4. **Set SAS URL-related Environment Variables in the `.env` File:**
Depending on the sample you plan to run, configure the required environment variables in the [.env](../notebooks/.env) file. There are two options to set up environment variables that utilize the required Shared Access Signature (SAS) URL.

- **Option A - Generate a SAS URL Manually via Azure Storage Explorer**
- Right-click on the blob container and select **Get Shared Access Signature...** from the menu.
- Select the permissions: **Read**, **Write**, and **List**.
- Note: **Write** permission is required for uploading, modifying, or appending blobs.
- Click the **Create** button.
<img src="./get-access-signature.png" height="600" /> <img src="./choose-signature-options.png" height="600" />
- **Copy the SAS URL:** After creating the SAS, click **Copy** to get the URL with the token. This URL will be used as the value for either **TRAINING_DATA_SAS_URL** or **REFERENCE_DOC_SAS_URL** when running the sample code.
<img src="./copy-access-signature.png" width="600" />

- Set the following in [.env](../notebooks/.env).
> NOTE: **REFERENCE_DOC_SAS_URL** can be the same as the **TRAINING_DATA_SAS_URL** to re-use the same blob container
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the SAS URL as value of **TRAINIGN_DATA_SAS_URL**.
- Set the following variables in the [.env](../notebooks/.env) file:
> **Note:** The value for **REFERENCE_DOC_SAS_URL** can be the same as **TRAINING_DATA_SAS_URL** to reuse the same blob container.
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the SAS URL as the value of **TRAINING_DATA_SAS_URL**.
```env
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity, Consistency, Formatting]

    • change: Revised the heading for better clarity and capitalized consistently ("Set Environment Variables for Training Data and Reference Documents in Pro Mode").
    • rationale: Improved readability and maintained consistent capitalization in section titles.
    • impact: Makes the heading clearer and more professional, helping readers quickly understand the section focus.
  • categories: [Grammar, Clarity]

    • change: Rewrote the introductory paragraph for smoother flow and more formal tone.
    • rationale: The original text was somewhat informal and had minor grammatical issues; the revised version improves readability and professionalism.
    • impact: Enhances user comprehension and presents instructions in a more polished manner.
  • categories: [Formatting, Consistency]

    • change: Added line breaks, bold formatting for step titles, and standardized punctuation and casing within numbered steps and bullet points.
    • rationale: Consistent formatting helps distinguish between steps and sub-steps, making instructions easier to follow.
    • impact: Improves document navigation and aids users in parsing multi-step instructions.
  • categories: [Clarity]

    • change: Clarified the description of using Azure Storage Explorer, including specifying login with credentials and linking to the installation guide more accurately.
    • rationale: More precise instructions reduce confusion for users unfamiliar with the tool.
    • impact: Provides clearer guidance to set up necessary tools, reducing potential setup errors.
  • categories: [Clarity, Formatting]

    • change: Adjusted bullet points describing the SAS generation steps to use bold text for menu items and permissions, included a note on the necessity of Write permission, and connected related images more cleanly.
    • rationale: Highlights key UI elements and permissions, making the process easier to follow visually.
    • impact: Users can more easily identify required actions and settings within Azure Storage Explorer.
  • categories: [Typo Fix, Consistency]

    • change: Corrected the environment variable name from TRAINIGN_DATA_SAS_URL to TRAINING_DATA_SAS_URL, and rephrased related sentences for consistency and clarity.
    • rationale: Fixing typos prevents user errors when configuring environment variables; consistent naming avoids confusion.
    • impact: Reduces risk of configuration mistakes, improving user experience and accuracy.
  • categories: [Formatting]

    • change: Reformatted code snippets and inline environment variable references for better visual distinction and readability.
    • rationale: Proper formatting helps separate code from explanatory text, aiding comprehension.
    • impact: Makes it easier for users to copy and understand configuration examples correctly.

TRAINING_DATA_SAS_URL=<Blob container SAS URL>
```
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the SAS URL as value of **REFERENCE_DOC_SAS_URL**.
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the SAS URL as the value of **REFERENCE_DOC_SAS_URL**.
```env
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity]
    • change: Added the definite article "the" before "value" in the phrase describing the SAS URL assignment.
    • rationale: Including "the" makes the instruction grammatically correct and clearer to the reader.
    • impact: This change improves the readability and professionalism of the documentation by providing a complete and precise description.

REFERENCE_DOC_SAS_URL=<Blob container SAS URL>
```
- Option B - Auto-generate the SAS URL via code in sample notebooks
- Instead of manually creating a SAS URL, you can set storage account and container information, and let the code generate a temporary SAS URL at runtime.
> NOTE: **TRAINING_DATA_STORAGE_ACCOUNT_NAME** and **TRAINING_DATA_CONTAINER_NAME** can be the same as the **REFERENCE_DOC_STORAGE_ACCOUNT_NAME** and **REFERENCE_DOC_CONTAINER_NAME** to re-use the same blob container
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the storage account name as `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `TRAINING_DATA_CONTAINER_NAME`.

- **Option B - Auto-generate the SAS URL via Code in Sample Notebooks**
- Instead of manually creating a SAS URL, you can specify the storage account and container information and let the code generate a temporary SAS URL at runtime.
> **Note:** **TRAINING_DATA_STORAGE_ACCOUNT_NAME** and **TRAINING_DATA_CONTAINER_NAME** can be the same as **REFERENCE_DOC_STORAGE_ACCOUNT_NAME** and **REFERENCE_DOC_CONTAINER_NAME** to reuse the same blob container.
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add the storage account name as `TRAINING_DATA_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `TRAINING_DATA_CONTAINER_NAME`.
```env
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Formatting, Consistency]

    • change: Capitalized and bolded the option title, and added trailing double spaces to enable line breaks in markdown.
    • rationale: Enhances visual hierarchy by making the option heading more prominent and improves markdown rendering with explicit line breaks for better readability.
    • impact: Makes the section easier to scan and visually distinct, improving user comprehension.
  • categories: [Grammar, Clarity]

    • change: Reworded the sentence from "you can set storage account and container information, and let the code generate..." to "you can specify the storage account and container information and let the code generate..."
    • rationale: Slight rephrasing for smoother and clearer sentence flow.
    • impact: Improves the clarity and professionalism of the documentation.
  • categories: [Formatting, Consistency]

    • change: Reformatted the note from a block quote with "NOTE:" to one with bolded "Note:" and removed extra asterisks around variables; also removed repetitive "the" before variable names.
    • rationale: Aligns note styling with common markdown conventions and improves readability by reducing clutter.
    • impact: Provides a cleaner and more consistent look to notes, making important information easier to notice.
  • categories: [Grammar, Clarity]

    • change: Changed "re-use" to "reuse" (single word) within the note.
    • rationale: "Reuse" is the correct standard spelling.
    • impact: Correct spelling enhances professionalism and reduces distractions.
  • categories: [Formatting]

    • change: Added trailing double spaces at the end of some lines to enforce line breaks in rendered markdown.
    • rationale: Prevents markdown from merging lines into a single paragraph.
    • impact: Retains intended formatting, improving the visual structure of the documentation.

TRAINING_DATA_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
TRAINING_DATA_CONTAINER_NAME=<your-container-name>
```
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the storage account name as `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `REFERENCE_DOC_CONTAINER_NAME`.
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add the storage account name as `REFERENCE_DOC_STORAGE_ACCOUNT_NAME` and the container name under that storage account as `REFERENCE_DOC_CONTAINER_NAME`.
```env
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Formatting]
    • change: Added two trailing spaces at the end of a line to enforce a line break before a code block.
    • rationale: In Markdown, trailing spaces are used to create a line break, ensuring that the preceding paragraph and the following code block are properly separated.
    • impact: This change improves the readability and visual structure of the documentation by correctly rendering the code block on a new line.

REFERENCE_DOC_STORAGE_ACCOUNT_NAME=<your-storage-account-name>
REFERENCE_DOC_CONTAINER_NAME=<your-container-name>
```

5. *Set Folder Prefix in ".env" File:* Depending on the sample that you will run, you will need to set required environment variables in [.env](../notebooks/.env).
- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add a prefix for **TRAINING_DATA_PATH**. You can choose any folder name you like for **TRAINING_DATA_PATH**. For example, you could use "training_files".
5. **Set Folder Prefixes in the `.env` File:**
Depending on the sample you will run, set the required environment variables in the [.env](../notebooks/.env) file.

- For [analyzer_training](../notebooks/analyzer_training.ipynb): Add a prefix for **TRAINING_DATA_PATH**. You can choose any folder name within the blob container. For example, use `training_files`.
```env
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity, Formatting]

    • change: Rewrote the heading for step 5 to use bold formatting and corrected the phrasing from singular "Prefix" to plural "Prefixes". Added a line break for better readability after the heading.
    • rationale: Improved the grammatical accuracy by matching plural form with the content and enhanced readability by separating the heading from the body text.
    • impact: Makes the instructions clearer and visually easier to follow, improving user comprehension.
  • categories: [Clarity]

    • change: Simplified the sentence "Depending on the sample that you will run, you will need to set required environment variables..." to "Depending on the sample you will run, set the required environment variables..."
    • rationale: Removed unnecessary words to make the sentence more direct and concise.
    • impact: Enhances clarity and reduces reading complexity.
  • categories: [Clarity, Formatting]

    • change: Changed the example folder name from quoted string ("training_files") to inline code formatting (training_files). Added a clarification that the chosen folder name should be within the blob container. Added two spaces at the end of the example line to ensure a line break before the code block.
    • rationale: Using code formatting helps distinguish file or folder names from regular text. Specifying that the folder should be within the blob container clarifies the context for users. The line break improves layout and readability.
    • impact: Enhances user understanding and prevents confusion when setting folder paths; improves visual presentation of the documentation.

TRAINING_DATA_PATH=<Designated folder path under the blob container>
```
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add a prefix for **REFERENCE_DOC_PATH**. You can choose any folder name you like for **REFERENCE_DOC_PATH**. For example, you could use "reference_docs".
- For [field_extraction_pro_mode](../notebooks/field_extraction_pro_mode.ipynb): Add a prefix for **REFERENCE_DOC_PATH**. You can choose any folder name within the blob container. For example, use `reference_docs`.
```env
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Consistency, Formatting]
    • change: Replaced the vague phrase "any folder name you like" with the more precise "any folder name within the blob container" and changed the example folder name formatting from bold to inline code.
    • rationale: To clarify the scope of the folder naming (specifying it must be within the blob container) and to standardize the formatting of folder names using code style, which is common for paths.
    • impact: Enhances the reader's understanding by providing clearer guidance on folder location and improves readability and consistency of the documentation.

REFERENCE_DOC_PATH=<Designated folder path under the blob container>
```

Now, we have completed the preparation of the data environment. Next, we could create an analyzer through code.


Once these steps are completed, your data environment is ready. You can proceed to create an analyzer through code.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity]
    • change: Reworded the sentence from "Now, we have completed the preparation of the data environment. Next, we could create an analyzer through code." to "Once these steps are completed, your data environment is ready. You can proceed to create an analyzer through code."
    • rationale: The original sentence used less natural phrasing ("Next, we could") and a passive tone. The revision uses clearer, more direct language and better flow.
    • impact: Enhances readability and clarity, making the instructions more understandable and easier to follow.