Skip to content

Review chienyuanchang/change_convertor_readme_format-python/di_to_cu_migration_tool/README.md #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: chienyuanchang/change_convertor_readme_format
Choose a base branch
from

Conversation

chienyuanchang
Copy link
Collaborator

Automated review and documentation improvements for python/di_to_cu_migration_tool/README.md on branch chienyuanchang/change_convertor_readme_format

LLM usage details:

  • Total tokens: 5567
  • Prompt tokens: 3016
  • Completion tokens: 2551
  • Used deployment: gpt-4.1-mini-yslin-dev-exp
  • API version: 2024-12-01-preview

Copy link
Collaborator Author

@chienyuanchang chienyuanchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated LLM code review (section-based).

LLM usage details:

  • Total tokens used: 9925.
  • Used deployment: gpt-4.1-mini-yslin-dev-exp
  • API version: 2024-12-01-preview

For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality.
To identify the version of your Document Intelligence dataset, consult the sample documents in this folder to match your format. You can also verify the version by reviewing your DI project's user experience: for example, Custom Extraction DI 3.1/4.0 GA appears in Document Intelligence Studio (https://documentintelligence.ai.azure.com/studio), whereas Document Field Extraction DI 4.0 Preview is available only on Azure AI Foundry preview service (https://ai.azure.com/explore/aiservices/vision/document/extraction).

For migrating from these DI versions to Content Understanding Preview.2, this tool first converts the DI dataset into a CU-compatible format. After conversion, you can create a Content Understanding Analyzer trained on the converted CU dataset and test it to validate its quality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Grammar, Consistency]

    • change: Rewrote the introductory sentence to a more concise and active voice ("We've created this tool to help..." → "This tool helps...").
    • rationale: The new phrasing is clearer and more direct, improving readability.
    • impact: Enhances reader comprehension and engagement by using a straightforward description.
  • categories: [Formatting, Clarity]

    • change: Reformatted the list of supported DI versions by removing redundant spacing and aligning descriptions (removing "seen in" phrasing, adjusting spacing around slashes).
    • rationale: The new formatting standardizes the presentation and removes clutter, making the versions easier to scan.
    • impact: Improves document visual consistency and facilitates quick identification of supported versions.
  • categories: [Clarity, Grammar]

    • change: Revised instructions for version identification with simpler sentence structures and active voice ("To help you identify..." → "To identify..."). Also clarified the explanation about UX references.
    • rationale: Simplifies complex sentences and improves comprehension by removing unnecessary words.
    • impact: Readers can more easily follow instructions to identify their dataset version.
  • categories: [Grammar, Clarity]

    • change: Changed passive and conditional phrasing to active and definitive ("this tool first needs to convert" → "this tool first converts") and combined the explanation about creating and testing a CU analyzer into a more fluid sentence.
    • rationale: Active voice and concise phrasing make the instructions clearer and more assertive.
    • impact: Users get a clearer understanding of the migration workflow without ambiguity.
  • categories: [Formatting]

    • change: Added a blank line between paragraphs, improving separation of ideas.
    • rationale: Enhances readability by visually distinguishing separate points.
    • impact: Easier for readers to parse and understand distinct informational sections.

* After converting the dataset to CU format, this CLI tool creates a CU analyzer referring to the converted dataset.

* **call_analyze.py**
* This CLI tool tests that the migration completed successfully and assesses the quality of the created analyzer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Formatting]

    • change: Rephrased introductory sentence to a more concise and direct statement.
    • rationale: The original opening was verbose and less structured; the revision simplifies the introduction to improve readability.
    • impact: Provides a clearer and more approachable introduction to the section, helping readers quickly understand the purpose.
  • categories: [Formatting, Consistency]

    • change: Converted the bullet points describing each CLI tool from asterisks with indented subitems to a cleaner nested bullet list format with hyphens and arrows.
    • rationale: The new format increases readability by visually distinguishing main points from subpoints and making file mappings clearer.
    • impact: Enhances scannability, allowing users to better grasp file mappings and tool functionality at a glance.
  • categories: [Grammar, Clarity]

    • change: Improved phrasing around the use of conversion scripts depending on DI version from a somewhat awkward sentence to a more straightforward conditional statement.
    • rationale: The revision makes the dependency on DI version and choice of conversion script clearer and easier to understand.
    • impact: Reduces ambiguity about which scripts are used and under what conditions, aiding user comprehension.
  • categories: [Grammar, Clarity, Formatting]

    • change: Reworked the explanation about OCR conversion and sample analyzers to be more precise and divided into clearer sentences.
    • rationale: The original contained run-on constructions and less structured explanation; the update breaks down complex information into digestible parts.
    • impact: Makes technical details about OCR processing easier to follow and references to additional files more explicit.
  • categories: [Clarity, Formatting]

    • change: Simplified the descriptions of the second and third CLI tools by shortening sentences and making purpose statements more direct.
    • rationale: The revision removes unnecessary wording and clarifies the tools' functions.
    • impact: Provides users with a succinct, clear understanding of what each tool does without extraneous detail.

- **SUBSCRIPTION_KEY:** Update to your Azure AI Service API Key or Subscription ID to authenticate the API requests.
- Locate your API Key here: ![Azure AI Service Endpoints With Keys](assets/endpoint-with-keys.png)
- If using Azure Active Directory (AAD), refer to your Subscription ID: ![Azure AI Service Subscription ID](assets/subscription-id.png)
- **API_VERSION:** This is preset to the CU Preview.2 version; no changes are needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Formatting]

    • change: Rephrased the introductory sentence from "To set up this tool, you will need to do the following steps:" to "Follow these steps to set up the tool:" and reformatted the installation command into an indented code block.
    • rationale: The new phrasing is more direct and reader-friendly; formatting the command as a code block improves readability and distinguishes it clearly as a command to run.
    • impact: Enhances user comprehension and makes the instruction visually clearer and easier to follow.
  • categories: [Clarity, Consistency, Formatting]

    • change: Changed "Rename the file .sample_env to .env" to "Rename the file .sample_env to .env" (minor rephrasing but this line was preserved as is; the main change was in subsequent steps).
    • rationale: No change indicated here, but the following steps were restructured, so overall flow improves.
  • categories: [Clarity, Formatting, Consistency]

    • change: Reformatted and clarified the instructions for modifying the .env file: changed "Replace the following values in the .env file:" to "Edit the .env file to update the following values:"; updated bullet points to have consistent indentation, clearer language, and inline code-style quotations for examples; replaced verbose explanations with more concise phrases; improved markdown image syntax by adding alt text and removing redundant titles.
    • rationale: These changes make the instructions easier to read and understand, maintain consistent style and tone, and improve the visual presentation of example values and images.
    • impact: Enhances the user’s ability to quickly grasp which values to update and where to find them, while ensuring the documentation looks cleaner and more professional.

**Notes:**
- SAS URLs do not specify a specific folder. To ensure the correct paths for source and target datasets, specify the dataset folder using `--source-blob-folder` and `--target-blob-folder`.
- To generate a SAS URL for a specific file, navigate directly to that file and repeat the process, for example:
![Generate SAS for Individual File](assets/individual-file-generate-sas.png)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Grammar, Clarity, Formatting]

    • change: Rewrote step-by-step instructions for migrating the Document Field Extraction dataset with numbered steps and clearer language. Removed redundant phrasing ("please follow the steps below") and replaced bullet points with numbered lists. Replaced quotes around UI elements with bold text and corrected spacing and punctuation.
    • rationale: Numbered steps and consistent bolding of UI elements improve readability and make the instructions easier to follow. Eliminating informal language ("please") and redundant phrases tightens the text.
    • impact: Enhances clarity and professional tone, making the documentation more user-friendly and visually structured.
  • categories: [Formatting, Consistency, Clarity]

    • change: Updated image markdown to include descriptive alt text matching the image title and removed redundant "Alt text" labels. Also corrected image captions to be concise and match UI element names exactly (e.g., "Management Center" instead of "Alt text").
    • rationale: Proper descriptive alt text improves accessibility and ensures images are clearly identified, and consistency in captioning aligns with common documentation standards.
    • impact: Improves accessibility and maintains uniform style throughout the document.
  • categories: [Clarity, Consistency]

    • change: Streamlined explanation of how to locate Azure Blob Storage resource URLs by clarifying terminology (e.g., "resource's target URL contains your dataset’s storage account") without mentioning colors ("in yellow"/"in blue"). Adjusted wording for easier comprehension and active voice.
    • rationale: Removing references to colors not visible in all versions and making sentences active voice avoids confusion and enhances understanding.
    • impact: Readers can more easily and accurately follow instructions without dependence on highlighting colors.
  • categories: [Formatting, Clarity]

    • change: Reformatted examples and folder navigation instructions into clearer, concise sentences with proper bolding for folder names rather than quotes.
    • rationale: Clear formatting and proper emphasis on folder names reduce ambiguity.
    • impact: Users better understand where to find project contents.
  • categories: [Grammar, Clarity, Formatting]

    • change: Rewrote the section on generating SAS URLs with numbered steps, consistent bold UI labels, bullet lists for permission settings, and clearer separation between concepts (source vs target datasets). Images reordered appropriately with contextual captions.
    • rationale: Clear numbering and bullet lists distinctly break down instructions, reducing confusion about permissions and process steps.
    • impact: Improves comprehension of the SAS URL generation process and permission requirements.
  • categories: [Clarity, Formatting]

    • change: Converted notes section into bullet points with bolded "Notes," improved wording to clarify that SAS URLs do not specify folders and that specific folder parameters must be set. Also improved phrasing on how to generate SAS for individual files with better image caption consistency.
    • rationale: Structured notes improve scanning and highlight critical caveats. Clear wording helps prevent user errors.
    • impact: Ensures users are aware of important limitations and correct usage, reducing chances of mistakes during migration.

Additionally, specifying --output-json isn't necessary. The default location for the output is "./sample_documents/analyzer_result.json."
For `--analyzer-id`, use the analyzer ID created in the prior step.

Specifying `--output-json` is optional; if omitted, the default output location is `./sample_documents/analyzer_result.json`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Formatting]

    • change: Reworded introductory sentences to combine related instructions and clarify line-breaking removal before execution.
    • rationale: The original version was split awkwardly and the message on quoting URLs was separated; combining these improves flow and user understanding.
    • impact: Users get a clearer, more compact instruction on how to run the commands and the importance of quoting URLs.
  • categories: [Clarity, Consistency]

    • change: Changed section titles from gerunds ("Converting," "Creating," "Running") to imperative ("Convert," "Create," "Run") for consistency and action orientation.
    • rationale: Imperative verbs in instructional documents improve clarity by emphasizing the action users need to perform.
    • impact: Creates a consistent, direct tone across all steps, making the instructions easier to follow.
  • categories: [Formatting, Clarity]

    • change: Converted multiline command examples from split lines without backslashes to properly escaped multiline commands inside fenced code blocks with line continuation backslashes (\).
    • rationale: Proper line continuations are necessary for shell usage; showing them explicitly prevents user errors.
    • impact: Enhances readability while preserving executable command formatting, reducing execution mistakes.
  • categories: [Clarity]

    • change: Improved explanations about the importance and effect of the --analyzer-prefix flag for different migration scenarios, rephrasing for clearer distinction between required and optional usage and how analyzer IDs are formed.
    • rationale: The original phrasing was verbose and somewhat ambiguous regarding when the prefix is mandatory and what ID will be generated. The revision presents these details more straightforwardly.
    • impact: Users better understand how to use the prefix option and the implications for analyzer IDs.
  • categories: [Clarity]

    • change: Reworded notes about limitations (e.g., “only one analyzer per analyzer ID”) to use more direct language and consistent formatting with bold and italic.
    • rationale: A clearer, emphasized note improves user attention to critical restrictions.
    • impact: Reduces the chance of violating constraints due to misunderstanding.
  • categories: [Clarity, Formatting]

    • change: Reorganized instructions for the analyzer creation step to show commands in fenced code blocks with line continuations and rephrased accompanying text to be more concise and active voice.
    • rationale: Enhances readability and usability of the instructions by making the commands copy-paste friendly and the explanations direct.
    • impact: Streamlines the setup process and reduces user effort.
  • categories: [Clarity]

    • change: Clarified instructions on obtaining the SAS URL for analyzer.json and using the analyzer ID output in the next steps, including improving the associated caption of the image.
    • rationale: The original text had scattered instructions and an incomplete caption. The revision consolidates these pragmatically.
    • impact: Users better understand how to retrieve necessary values and what to do next, improving workflow.
  • categories: [Formatting, Clarity]

    • change: Reformatted analyze command example into proper fenced code blocks with line continuation, rephrased instructions to use direct language and clearer parameter explanations.
    • rationale: Consistency in command formatting improves user experience; clearer parameter notes reduce confusion.
    • impact: Simplifies comprehension and execution of analyzing commands.
  • categories: [Clarity]

    • change: Explicitly stated that the --output-json parameter is optional and what the default output path is if omitted.
    • rationale: Previously, this was indirect; being explicit avoids uncertainty.
    • impact: User reduces guesswork about output file locations when running the analyze command.

2. Signature field types (e.g., in previous DI versions) are not yet supported in Content Understanding. These will be ignored during migration when creating the analyzer.
3. The content of training documents is retained in Content Understanding model metadata, under storage specifically. More details at:
https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json
4. All conversions are for Content Understanding preview.2 version only. No newline at end of file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • categories: [Clarity, Formatting]

    • change: Rewrote the introductory sentence about common issues encountered when creating an analyzer or running analysis.
    • rationale: The original sentence was informal and less precise; the new version is clearer and more direct.
    • impact: Provides a more professional and concise introduction, improving readability.
  • categories: [Clarity, Formatting]

    • change: Reformatted and rephrased the guidance under "400 error" to a bulleted list with clear subpoints about endpoint validity and naming constraints.
    • rationale: The original text was less structured and harder to scan; the new format uses indentation and code blocks for better presentation.
    • impact: Enhances the user's ability to quickly identify and understand prerequisites and naming rules related to 400 errors.
  • categories: [Clarity, Consistency]

    • change: Expanded HTTP status codes with descriptions (e.g., "400 Bad Request", "401 Unauthorized", "409 Conflict") and provided consistent explanations for each error code.
    • rationale: Original error explanations were brief and inconsistent; the new format standardizes terminology and explanation style.
    • impact: Improves uniformity, making error handling guidance easier to follow and more professional in tone.
  • categories: [Formatting, Clarity]

    • change: Rewrote the "Calling Analyze" error section using bulleted error codes with detailed explanations and example URLs, replacing the prior inline explanatory sentences.
    • rationale: The previous format was less scannable and lacked consistent emphasis on error codes.
    • impact: Users can more quickly find relevant troubleshooting information, improving usability.
  • categories: [Formatting, Consistency]

    • change: Transformed the "Points to Note" section from a mix of prose and inconsistent numbering to a clear, numbered list with properly formatted items, including a corrected numbering sequence.
    • rationale: The prior version had inconsistent item numbering and mixed formatting, which reduced clarity.
    • impact: Enhances readability and helps users easily digest important notes.
  • categories: [Grammar, Clarity]

    • change: Corrected some grammatical structures and rephrased sentences for smoother reading (e.g., changing "Make sure to use Python version 3.9 or above" to "Use Python version 3.9 or higher").
    • rationale: The new phrasing is more concise and formal.
    • impact: Improves professionalism and readability of the documentation.
  • categories: [Clarity]

    • change: Added explicit references to "Authentication failure" and "Verify your API key and/or subscription ID" in 401 error explanations for both creating an analyzer and calling analyze.
    • rationale: Original explanations implied authentication failure but were less specific.
    • impact: Provides clearer troubleshooting steps, improving user understanding.
  • categories: [Clarity, Formatting]

    • change: Added backticks around URLs in examples to visually distinguish them as code or literal strings.
    • rationale: Helps users identify URLs clearly and recognize them as input to be copied or referenced.
    • impact: Enhances user experience by improving visual clarity of critical information.


For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality.
To identify the version of your Document Intelligence dataset, consult the sample documents in this folder to match your format. You can also verify the version by reviewing your DI project's user experience: for example, Custom Extraction DI 3.1/4.0 GA appears in Document Intelligence Studio (https://documentintelligence.ai.azure.com/studio), whereas Document Field Extraction DI 4.0 Preview is available only on Azure AI Foundry preview service (https://ai.azure.com/explore/aiservices/vision/document/extraction).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"You can also verify the version by reviewing your DI project's user experience. For instance, Custom Extraction DI 3.1/4.0 GA appears in Document Intelligence Studio (https://documentintelligence.ai.azure.com/studio), whereas Document Field Extraction DI 4.0 Preview is only available on Azure AI Foundry's preview service (https://ai.azure.com/explore/aiservices/vision/document/extraction)."

For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality.
To identify the version of your Document Intelligence dataset, consult the sample documents in this folder to match your format. You can also verify the version by reviewing your DI project's user experience: for example, Custom Extraction DI 3.1/4.0 GA appears in Document Intelligence Studio (https://documentintelligence.ai.azure.com/studio), whereas Document Field Extraction DI 4.0 Preview is available only on Azure AI Foundry preview service (https://ai.azure.com/explore/aiservices/vision/document/extraction).

For migrating from these DI versions to Content Understanding Preview.2, this tool first converts the DI dataset into a CU-compatible format. After conversion, you can create a Content Understanding Analyzer trained on the converted CU dataset and test it to validate its quality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After conversion, you can create a Content Understanding Analyzer that is trained on your converted CU dataset. Additionally, you have the option to test its quality against any sample documents.


_**NOTE:** You are only allowed to create one analyzer per analyzer ID._
For this migration, specifying an analyzer prefix is optional. However, to create multiple analyzers from the same analyzer.json, add an analyzer prefix. If provided, the analyzer ID becomes `analyzer-prefix_doc-type`; otherwise, it remains as the `doc_type` in fields.json.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, to create multiple analyzers from the same analyzer.json, you will need to add an analyzer prefix.

Validate the following:
- The endpoint URL is valid. Example:
`https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview`
- Your converted CU dataset respects the naming constraints below. If needed, manually correct `analyzer.json` fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If needed, please manually correct the analyzer.json fields.

`https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview`
- Your converted CU dataset respects the naming constraints below. If needed, manually correct `analyzer.json` fields:
- Field names start with a letter or underscore
- Field name lengths are between 1 and 64 characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field name length is in between 1 and 64 characters

- Field names start with a letter or underscore
- Field name lengths are between 1 and 64 characters
- Only letters, numbers, and underscores are allowed
- Analyzer ID meets naming requirements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your Analyzer ID meets these naming requirements

- Field name lengths are between 1 and 64 characters
- Only letters, numbers, and underscores are allowed
- Analyzer ID meets naming requirements:
- Length between 1 and 64 characters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ID length is in between 1 and 64 characters

5. All the data conversion will be for Content Understanding preview.2 version only.

- **400 Bad Request**:
Possibly incorrect endpoint or SAS URL. Ensure your endpoint is valid:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that you might have a potentially incorrect endpoint or SAS URL. Please ensure that your endpoint is valid and that you are using the correct SAS URL for the document.

Confirm you are using the correct SAS URL for the document.

- **401 Unauthorized**:
Authentication failure. Verify your API key and/or subscription ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies an authentication failure. Please verify your API Key and/or your Subscription ID.

Authentication failure. Verify your API key and/or subscription ID.

- **404 Not Found**:
Analyzer with the specified ID does not exist. Use the correct analyzer ID or create an analyzer with that ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that the analyzer with the specified ID does not exist. Please use the correct analyzer ID or create an analyzer with the specified ID.


1. Use Python version 3.9 or higher.
2. Signature field types (e.g., in previous DI versions) are not yet supported in Content Understanding. These will be ignored during migration when creating the analyzer.
3. The content of training documents is retained in Content Understanding model metadata, under storage specifically. More details at:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of your training documents is retained in the CU model's metadata, under storage specifically. You can find more details at:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that you have removed all instances of "please." If this is still going out to customers, it would be nice to include this sort of friendly wording :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants