Skip to content

Commit 3788347

Browse files
aainav269Aaina Vannan
andauthored
Vannanaaina/oss di to cu migration tool (#29)
* added di to cu migration tool code * Update README.md - need to finish things of note * Update README.md * Update README.md * added sample documents * removed new data * changed analyze to use key and not id * resolving PR comments * changed outputFile to output_file * removed content understanding from host, as customers might not know to add this in * resolved Tim's comments * resolving Thomas' comments * made changes to implementation per Paul's comments * Update README.md need to finish readme still * Update README.md need to add stuff about how to get source and target info * analyzer_result.json is a CLI variable * Update README.md * Update README.md * Update README.md * Update README.md based on Yu-Yun's comments * Update README.md * Update README.md did Yu-Yun's suggestions, not Paul's yet * added assets * Update README.md * updated endpoint * Update README.md setup done * updated endpoint * Update README.md * added in other images * Update README.md * Update README.md * Update README.md * updated to show no subscription id * Update README.md * using analyzer for individual file now * Update README.md * Update README.md * Update README.md * adjusting storage browser * Update README.md updated until setup per Aditi's comments * Update README.md * updated endpoint * Update README.md * updated generate sas * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md --------- Co-authored-by: Aaina Vannan <[email protected]>
1 parent c989350 commit 3788347

30 files changed

+13420
-0
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Rename to .env
2+
HOST="<fill in your target endpoint here>"
3+
4+
API_VERSION = "2025-05-01-preview"
5+
6+
SUBSCRIPTION_KEY = "<fill in your API Key here>" # This is your API Key if you have one or can be your Subscription ID
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
# Document Intelligence to Content Understanding Migration Tool (Python)
2+
3+
Welcome! We've created this tool to help convert your Document Intelligence (DI) datasets to Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as seen in AI Foundry. The following DI versions are supported:
4+
- Custom Extraction Model DI 3.1 GA (2023-07-31) to DI 4.0 GA (2024-11-30) (seen in Document Intelligence Studio) --> DI-version = neural
5+
- Document Field Extraction Model 4.0 Preview (2024-07-31-preview) (seen in AI Foundry/AI Services/Vision + Document/Document Field Extraction) --> DI-version = generative
6+
7+
To help you identify which version of Document Intelligence your dataset is in, please consult the sample documents provided under this folder to determine which format matches that of yours. Additionally, you can also identify the version through your DI project's UX as well. For instance, Custom Extraction DI 3.1/4.0 GA is a part of Document Intelligence Studio (i.e., https://documentintelligence.ai.azure.com/studio) and Document Field Extraction DI 4.0 Preview is only available on Azure AI Foundry as a preview service (i.e., https://ai.azure.com/explore/aiservices/vision/document/extraction).
8+
9+
For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality.
10+
11+
## Details About the Tools
12+
To provide you with some further details, here is a more intricate breakdown of each of the 3 CLI tools and their capabilities:
13+
* **di_to_cu_converter.py**:
14+
* This CLI tool conducts your first step of migration. The tool refers to your labelled Document Intelligence dataset and converts it into a CU format compatible dataset. Through this tool, we map the following files accordingly: fields.json to analyzer.json, DI labels.json to CU labels.json, and ocr.json to result.json.
15+
* Depending on the DI version you wish to migrate from, we use [cu_converter_neural.py](cu_converter_neural.py) and [cu_converter_generative.py](cu_converter_generative.py) accordingly to convert your fields.json and labels.json files.
16+
* For OCR conversion, the tool creates a sample CU analyzer to gather raw OCR results via an Analyze request for each original file in the DI dataset. Additionally, since the sample analyzer contains no fields, we get the results.json files without any fields as well. For more details, please refer to [get_ocr.py](get_ocr.py).
17+
* **create_analyzer.py**:
18+
* Once the dataset is converted to CU format, this CLI tool creates a CU analyzer while referring to the converted dataset.
19+
* **call_analyze.py**:
20+
* This CLI tool can be used to ensure that the migration has successfully completed and to test the quality of the previously created analyzer.
21+
22+
## Setup
23+
To set up this tool, you will need to do the following steps:
24+
1. Run the requirements.txt file to install the needed dependencies via **pip install -r ./requirements.txt**
25+
2. Rename the file **.sample_env** to **.env**
26+
3. Replace the following values in the **.env** file:
27+
- **HOST:** Update this to your Azure AI service endpoint.
28+
- Ex: "https://sample-azure-ai-resource.services.ai.azure.com"
29+
- Avoid the "/" at the end.
30+
![Alt text](assets/sample-azure-resource.png "Azure AI Service")
31+
![Alt text](assets/endpoint.png "Azure AI Service Endpoints")
32+
- **SUBSCRIPTION_KEY:** Update this to your Azure AI Service's API Key or Subscription ID to identify and authenticate the API request.
33+
- You can locate your API KEY here: ![Alt text](assets/endpoint-with-keys.png "Azure AI Service Endpoints With Keys")
34+
- If you are using AAD, please refer to your Subscription ID: ![Alt text](assets/subscription-id.png "Azure AI Service Subscription ID")
35+
- **API_VERSION:** This version ensures that you are converting the dataset to CU Preview.2. No changes are needed here.
36+
37+
## How to Locate Your Document Field Extraction Dataset for Migration
38+
To migrate your Document Field Extraction dataset from AI Foundry, please follow the steps below:
39+
1. On the bottom left of your Document Field Extraction project page, please select "Management Center."
40+
![Alt text](assets/management-center.png "Management Center")
41+
2. Now on the Management Center page, please select "View All" from the Connected Resources section.
42+
![Alt text](assets/connected-resources.png "Connected Resources")
43+
3. Within these resources, look for the resource with type "Azure Blob Storage." This resource's target URL contains the location of your dataset's storage account (in yellow) and blob container (in blue).
44+
![Alt text](assets/manage-connections.png "Manage Connections")
45+
Using these values, navigate to your blob container. Then, select the "labelingProjects" folder. From there, select the folder with the same name as the blob container. Here, you'll locate all the contents of your project in the "data" folder.
46+
47+
For example, the sample Document Field Extraction project is stored at
48+
![Alt text](assets/azure-portal.png "Azure Portal")
49+
50+
## How to Find Your Source and Target SAS URLs
51+
To run migration, you will need to specify the source SAS URL (location of your Document Intelligence dataset) and target SAS URL (location for your Content Understanding dataset).
52+
53+
To locate the SAS URL for a file or folder for any container URL arguments, please follow these steps:
54+
55+
1. Navigate to your storage account in Azure Portal, and from the left pane, select "Storage Browser."
56+
![Alt text](assets/storage-browser.png "Storage Browser")
57+
2. Select the source/target blob container for either where your DI dataset is present or where your CU dataset will be. Click on the extended menu on the side and select "Generate SAS."
58+
![Alt text](assets/generate-sas.png "Generate SAS")
59+
3. Configure the permissions and expiry for your SAS URL accordingly.
60+
61+
For the DI source dataset, please select these permissions: _**Read & List**_
62+
63+
For the CU target dataset, please select these permissions: _**Read, Add, Create, & Write**_
64+
65+
Once configured, please select "Generate SAS Token and URL" & copy the URL shown under "Blob SAS URL."
66+
67+
![Alt text](assets/generate-sas-pop-up.png "Generate SAS Pop-Up")
68+
69+
Notes:
70+
71+
- Since SAS URL does not point to a specific folder, to ensure the correct path for source and target, please specify the correct dataset folder as --source-blob-folder or --target-blob-folder.
72+
- To get the SAS URL for a single file, navigate to the specific file and repeat the steps above, such as:
73+
![Alt text](assets/individual-file-generate-sas.png "Generate SAS for Individual File")
74+
75+
## How to Run
76+
To run the 3 tools, please refer to the following commands. For better readability, they are split across lines. Please remove this extra spacing before execution.
77+
78+
_**NOTE:** Use "" when entering in a URL._
79+
80+
### 1. Converting Document Intelligence to Content Understanding Dataset
81+
82+
If you are migrating a _DI 3.1/4.0 GA Custom Extraction_ dataset, please run this command:
83+
84+
python ./di_to_cu_converter.py --DI-version neural --analyzer-prefix mySampleAnalyzer
85+
--source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName
86+
--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName
87+
88+
For migration of Custom Extraction DI 3.1/4.0 GA, specifying an analyzer prefix is crucial for creating a CU analyzer. Since there's no "doc_type" defined for any identification in the fields.json, the created analyzer will have an analyzer ID of the specified analyzer prefix.
89+
90+
If you are migrating a _DI 4.0 Preview Document Field Extraction_ dataset, please run this command:
91+
92+
python ./di_to_cu_converter.py --DI-version generative --analyzer-prefix mySampleAnalyzer
93+
--source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName
94+
--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName
95+
96+
For migration of Document Field Extraction DI 4.0 Preview, specifying an analyzer prefix is optional. However, if you wish to create multiple analyzers from the same analyzer.json, please add an analyzer prefix. If provided, the analyzer ID will become analyzer-prefix_doc-type. Otherwise, it will simply remain as the doc_type in the fields.json.
97+
98+
_**NOTE:** You are only allowed to create one analyzer per analyzer ID._
99+
100+
### 2. Creating an Analyzer
101+
102+
To create an analyzer using the converted CU analyzer.json, please run this command:
103+
104+
python ./create_analyzer.py
105+
--analyzer-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer/cuDatasetFolderName/analyzer.json?targetSASToken"
106+
--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken"
107+
--target-blob-folder cuDatasetFolderName
108+
109+
The analyzer.json file is stored in the specified target blob container and folder. Please get the SAS URL for the analyzer.json file from there.
110+
111+
Additionally, please use the analyzer ID from this output when running the call_analyze.py tool.
112+
113+
Ex:
114+
115+
![Alt text](assets/analyzer.png "Sample Analyzer Creation")
116+
117+
### 3. Running Analyze
118+
119+
To analyze a specific PDF or original file, please run this command:
120+
121+
python ./call_analyze.py --analyzer-id mySampleAnalyzer
122+
--pdf-sas-url "https://storageAccount.blob.core.windows.net/container/folder/sample.pdf?SASToken"
123+
--output-json "./desired-path-to-analyzer-results.json"
124+
125+
For the --analyzer-id argument, please refer to the analyzer ID created in the previous step.
126+
Additionally, specifying --output-json isn't necessary. The default location for the output is "./sample_documents/analyzer_result.json."
127+
128+
## Possible Issues
129+
These are some issues that you might run into when creating an analyzer or running analyze.
130+
### Creating an Analyzer
131+
For any **400** error, please validate the following:
132+
- You are using a valid endpoint. Example: _https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview_
133+
- Your converted CU dataset may not meet the latest naming constraints. Please ensure that all the fields in your analyzer.json file meet these requirements. If not, please make the changes manually.
134+
135+
- Field name only starts with a letter or an underscore
136+
- Field name length is between 1 and 64 characters
137+
- Only uses letters, numbers, and underscores
138+
- Your analyzer ID meets these naming requirements
139+
- ID is between 1 and 64 characters long
140+
- Only uses letters, numbers, dots, underscores, and hyphens
141+
142+
A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified.
143+
144+
A **409** error implies that the analyzer ID has already been used to create an analyzer. Please try using another ID.
145+
### Calling Analyze
146+
- A **400** error implies a potentially incorrect endpoint or SAS URL. Ensure that your endpoint is valid _(https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-05-01-preview)_ and that you are using the correct SAS URL for the document under analysis.
147+
- A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified.
148+
- A **404** error implies that no analyzer exists with the analyzer ID you have specified. Mitigate it by calling the correct ID or creating an analyzer with such an ID.
149+
150+
## Points to Note:
151+
1. Make sure to use Python version 3.9 or above.
152+
2. Signature field types (such as in the previous versions of DI) are not supported in Content Understanding yet. Thus, during migration, these signature fields will be ignored when creating the analyzer.
153+
3. The content of training documents will be retained in Content Understanding model metadata, under storage specifically. Additional explanation can be found here: https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json
154+
5. All the data conversion will be for Content Understanding preview.2 version only.
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading
Lines changed: 3 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)