|
| 1 | +# Document Intelligence to Content Understanding Migration Tool (Python) |
| 2 | + |
| 3 | +Welcome! We've created this tool to help convert your Document Intelligence (DI) datasets to Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as seen in AI Foundry. The following DI versions are supported: |
| 4 | +- Custom Extraction Model DI 3.1 GA (2023-07-31) to DI 4.0 GA (2024-11-30) (seen in Document Intelligence Studio) --> DI-version = neural |
| 5 | +- Document Field Extraction Model 4.0 Preview (2024-07-31-preview) (seen in AI Foundry/AI Services/Vision + Document/Document Field Extraction) --> DI-version = generative |
| 6 | + |
| 7 | +To help you identify which version of Document Intelligence your dataset is in, please consult the sample documents provided under this folder to determine which format matches that of yours. Additionally, you can also identify the version through your DI project's UX as well. For instance, Custom Extraction DI 3.1/4.0 GA is a part of Document Intelligence Studio (i.e., https://documentintelligence.ai.azure.com/studio) and Document Field Extraction DI 4.0 Preview is only available on Azure AI Foundry as a preview service (i.e., https://ai.azure.com/explore/aiservices/vision/document/extraction). |
| 8 | + |
| 9 | +For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality. |
| 10 | + |
| 11 | +## Details About the Tools |
| 12 | +To provide you with some further details, here is a more intricate breakdown of each of the 3 CLI tools and their capabilities: |
| 13 | +* **di_to_cu_converter.py**: |
| 14 | + * This CLI tool conducts your first step of migration. The tool refers to your labelled Document Intelligence dataset and converts it into a CU format compatible dataset. Through this tool, we map the following files accordingly: fields.json to analyzer.json, DI labels.json to CU labels.json, and ocr.json to result.json. |
| 15 | + * Depending on the DI version you wish to migrate from, we use [cu_converter_neural.py](cu_converter_neural.py) and [cu_converter_generative.py](cu_converter_generative.py) accordingly to convert your fields.json and labels.json files. |
| 16 | + * For OCR conversion, the tool creates a sample CU analyzer to gather raw OCR results via an Analyze request for each original file in the DI dataset. Additionally, since the sample analyzer contains no fields, we get the results.json files without any fields as well. For more details, please refer to [get_ocr.py](get_ocr.py). |
| 17 | +* **create_analyzer.py**: |
| 18 | + * Once the dataset is converted to CU format, this CLI tool creates a CU analyzer while referring to the converted dataset. |
| 19 | +* **call_analyze.py**: |
| 20 | + * This CLI tool can be used to ensure that the migration has successfully completed and to test the quality of the previously created analyzer. |
| 21 | + |
| 22 | +## Setup |
| 23 | +To set up this tool, you will need to do the following steps: |
| 24 | +1. Run the requirements.txt file to install the needed dependencies via **pip install -r ./requirements.txt** |
| 25 | +2. Rename the file **.sample_env** to **.env** |
| 26 | +3. Replace the following values in the **.env** file: |
| 27 | + - **HOST:** Update this to your Azure AI service endpoint. |
| 28 | + - Ex: "https://sample-azure-ai-resource.services.ai.azure.com" |
| 29 | + - Avoid the "/" at the end. |
| 30 | +  |
| 31 | +  |
| 32 | + - **SUBSCRIPTION_KEY:** Update this to your Azure AI Service's API Key or Subscription ID to identify and authenticate the API request. |
| 33 | + - You can locate your API KEY here:  |
| 34 | + - If you are using AAD, please refer to your Subscription ID:  |
| 35 | + - **API_VERSION:** This version ensures that you are converting the dataset to CU Preview.2. No changes are needed here. |
| 36 | + |
| 37 | +## How to Locate Your Document Field Extraction Dataset for Migration |
| 38 | +To migrate your Document Field Extraction dataset from AI Foundry, please follow the steps below: |
| 39 | +1. On the bottom left of your Document Field Extraction project page, please select "Management Center." |
| 40 | +  |
| 41 | +2. Now on the Management Center page, please select "View All" from the Connected Resources section. |
| 42 | +  |
| 43 | +3. Within these resources, look for the resource with type "Azure Blob Storage." This resource's target URL contains the location of your dataset's storage account (in yellow) and blob container (in blue). |
| 44 | +  |
| 45 | + Using these values, navigate to your blob container. Then, select the "labelingProjects" folder. From there, select the folder with the same name as the blob container. Here, you'll locate all the contents of your project in the "data" folder. |
| 46 | + |
| 47 | + For example, the sample Document Field Extraction project is stored at |
| 48 | +  |
| 49 | + |
| 50 | +## How to Find Your Source and Target SAS URLs |
| 51 | +To run migration, you will need to specify the source SAS URL (location of your Document Intelligence dataset) and target SAS URL (location for your Content Understanding dataset). |
| 52 | + |
| 53 | +To locate the SAS URL for a file or folder for any container URL arguments, please follow these steps: |
| 54 | + |
| 55 | +1. Navigate to your storage account in Azure Portal, and from the left pane, select "Storage Browser." |
| 56 | +  |
| 57 | +2. Select the source/target blob container for either where your DI dataset is present or where your CU dataset will be. Click on the extended menu on the side and select "Generate SAS." |
| 58 | +  |
| 59 | +3. Configure the permissions and expiry for your SAS URL accordingly. |
| 60 | + |
| 61 | + For the DI source dataset, please select these permissions: _**Read & List**_ |
| 62 | + |
| 63 | + For the CU target dataset, please select these permissions: _**Read, Add, Create, & Write**_ |
| 64 | + |
| 65 | + Once configured, please select "Generate SAS Token and URL" & copy the URL shown under "Blob SAS URL." |
| 66 | + |
| 67 | +  |
| 68 | + |
| 69 | +Notes: |
| 70 | + |
| 71 | +- Since SAS URL does not point to a specific folder, to ensure the correct path for source and target, please specify the correct dataset folder as --source-blob-folder or --target-blob-folder. |
| 72 | +- To get the SAS URL for a single file, navigate to the specific file and repeat the steps above, such as: |
| 73 | +  |
| 74 | + |
| 75 | +## How to Run |
| 76 | +To run the 3 tools, please refer to the following commands. For better readability, they are split across lines. Please remove this extra spacing before execution. |
| 77 | + |
| 78 | +_**NOTE:** Use "" when entering in a URL._ |
| 79 | + |
| 80 | +### 1. Converting Document Intelligence to Content Understanding Dataset |
| 81 | + |
| 82 | +If you are migrating a _DI 3.1/4.0 GA Custom Extraction_ dataset, please run this command: |
| 83 | + |
| 84 | + python ./di_to_cu_converter.py --DI-version neural --analyzer-prefix mySampleAnalyzer |
| 85 | + --source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName |
| 86 | + --target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName |
| 87 | + |
| 88 | +For migration of Custom Extraction DI 3.1/4.0 GA, specifying an analyzer prefix is crucial for creating a CU analyzer. Since there's no "doc_type" defined for any identification in the fields.json, the created analyzer will have an analyzer ID of the specified analyzer prefix. |
| 89 | + |
| 90 | +If you are migrating a _DI 4.0 Preview Document Field Extraction_ dataset, please run this command: |
| 91 | + |
| 92 | + python ./di_to_cu_converter.py --DI-version generative --analyzer-prefix mySampleAnalyzer |
| 93 | + --source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName |
| 94 | + --target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName |
| 95 | + |
| 96 | +For migration of Document Field Extraction DI 4.0 Preview, specifying an analyzer prefix is optional. However, if you wish to create multiple analyzers from the same analyzer.json, please add an analyzer prefix. If provided, the analyzer ID will become analyzer-prefix_doc-type. Otherwise, it will simply remain as the doc_type in the fields.json. |
| 97 | + |
| 98 | +_**NOTE:** You are only allowed to create one analyzer per analyzer ID._ |
| 99 | + |
| 100 | +### 2. Creating an Analyzer |
| 101 | + |
| 102 | +To create an analyzer using the converted CU analyzer.json, please run this command: |
| 103 | + |
| 104 | + python ./create_analyzer.py |
| 105 | + --analyzer-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer/cuDatasetFolderName/analyzer.json?targetSASToken" |
| 106 | + --target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" |
| 107 | + --target-blob-folder cuDatasetFolderName |
| 108 | + |
| 109 | +The analyzer.json file is stored in the specified target blob container and folder. Please get the SAS URL for the analyzer.json file from there. |
| 110 | + |
| 111 | +Additionally, please use the analyzer ID from this output when running the call_analyze.py tool. |
| 112 | + |
| 113 | +Ex: |
| 114 | + |
| 115 | + |
| 116 | + |
| 117 | +### 3. Running Analyze |
| 118 | + |
| 119 | +To analyze a specific PDF or original file, please run this command: |
| 120 | + |
| 121 | + python ./call_analyze.py --analyzer-id mySampleAnalyzer |
| 122 | + --pdf-sas-url "https://storageAccount.blob.core.windows.net/container/folder/sample.pdf?SASToken" |
| 123 | + --output-json "./desired-path-to-analyzer-results.json" |
| 124 | + |
| 125 | +For the --analyzer-id argument, please refer to the analyzer ID created in the previous step. |
| 126 | +Additionally, specifying --output-json isn't necessary. The default location for the output is "./sample_documents/analyzer_result.json." |
| 127 | + |
| 128 | +## Possible Issues |
| 129 | +These are some issues that you might run into when creating an analyzer or running analyze. |
| 130 | +### Creating an Analyzer |
| 131 | +For any **400** error, please validate the following: |
| 132 | +- You are using a valid endpoint. Example: _https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview_ |
| 133 | +- Your converted CU dataset may not meet the latest naming constraints. Please ensure that all the fields in your analyzer.json file meet these requirements. If not, please make the changes manually. |
| 134 | + |
| 135 | + - Field name only starts with a letter or an underscore |
| 136 | + - Field name length is between 1 and 64 characters |
| 137 | + - Only uses letters, numbers, and underscores |
| 138 | +- Your analyzer ID meets these naming requirements |
| 139 | + - ID is between 1 and 64 characters long |
| 140 | + - Only uses letters, numbers, dots, underscores, and hyphens |
| 141 | + |
| 142 | +A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified. |
| 143 | + |
| 144 | +A **409** error implies that the analyzer ID has already been used to create an analyzer. Please try using another ID. |
| 145 | +### Calling Analyze |
| 146 | +- A **400** error implies a potentially incorrect endpoint or SAS URL. Ensure that your endpoint is valid _(https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-05-01-preview)_ and that you are using the correct SAS URL for the document under analysis. |
| 147 | +- A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified. |
| 148 | +- A **404** error implies that no analyzer exists with the analyzer ID you have specified. Mitigate it by calling the correct ID or creating an analyzer with such an ID. |
| 149 | + |
| 150 | +## Points to Note: |
| 151 | +1. Make sure to use Python version 3.9 or above. |
| 152 | +2. Signature field types (such as in the previous versions of DI) are not supported in Content Understanding yet. Thus, during migration, these signature fields will be ignored when creating the analyzer. |
| 153 | +3. The content of training documents will be retained in Content Understanding model metadata, under storage specifically. Additional explanation can be found here: https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json |
| 154 | +5. All the data conversion will be for Content Understanding preview.2 version only. |
0 commit comments