diff --git a/python/di_to_cu_migration_tool/README.md b/python/di_to_cu_migration_tool/README.md index 48a5d64..ddbc652 100644 --- a/python/di_to_cu_migration_tool/README.md +++ b/python/di_to_cu_migration_tool/README.md @@ -1,154 +1,192 @@ # Document Intelligence to Content Understanding Migration Tool (Python) -Welcome! We've created this tool to help convert your Document Intelligence (DI) datasets to Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as seen in AI Foundry. The following DI versions are supported: -- Custom Extraction Model DI 3.1 GA (2023-07-31) to DI 4.0 GA (2024-11-30) (seen in Document Intelligence Studio) --> DI-version = neural -- Document Field Extraction Model 4.0 Preview (2024-07-31-preview) (seen in AI Foundry/AI Services/Vision + Document/Document Field Extraction) --> DI-version = generative +Welcome! This tool helps convert your Document Intelligence (DI) datasets to the Content Understanding (CU) **Preview.2** 2025-05-01-preview format, as used in AI Foundry. The following DI versions are supported: -To help you identify which version of Document Intelligence your dataset is in, please consult the sample documents provided under this folder to determine which format matches that of yours. Additionally, you can also identify the version through your DI project's UX as well. For instance, Custom Extraction DI 3.1/4.0 GA is a part of Document Intelligence Studio (i.e., https://documentintelligence.ai.azure.com/studio) and Document Field Extraction DI 4.0 Preview is only available on Azure AI Foundry as a preview service (i.e., https://ai.azure.com/explore/aiservices/vision/document/extraction). +- Custom Extraction Model DI 3.1 GA (2023-07-31) to DI 4.0 GA (2024-11-30) (Document Intelligence Studio) --> DI-version = neural +- Document Field Extraction Model 4.0 Preview (2024-07-31-preview) (AI Foundry / AI Services / Vision + Document / Document Field Extraction) --> DI-version = generative -For migration from these DI versions to Content Understanding Preview.2, this tool first needs to convert the DI dataset to a CU compatible format. Once converted, you have the option to create a Content Understanding Analyzer, which will be trained on the converted CU dataset. Additionally, you can further test this model to ensure its quality. +To identify the version of your Document Intelligence dataset, consult the sample documents in this folder to match your format. You can also verify the version by reviewing your DI project's user experience: for example, Custom Extraction DI 3.1/4.0 GA appears in Document Intelligence Studio (https://documentintelligence.ai.azure.com/studio), whereas Document Field Extraction DI 4.0 Preview is available only on Azure AI Foundry preview service (https://ai.azure.com/explore/aiservices/vision/document/extraction). + +For migrating from these DI versions to Content Understanding Preview.2, this tool first converts the DI dataset into a CU-compatible format. After conversion, you can create a Content Understanding Analyzer trained on the converted CU dataset and test it to validate its quality. ## Details About the Tools -To provide you with some further details, here is a more intricate breakdown of each of the 3 CLI tools and their capabilities: -* **di_to_cu_converter.py**: - * This CLI tool conducts your first step of migration. The tool refers to your labelled Document Intelligence dataset and converts it into a CU format compatible dataset. Through this tool, we map the following files accordingly: fields.json to analyzer.json, DI labels.json to CU labels.json, and ocr.json to result.json. - * Depending on the DI version you wish to migrate from, we use [cu_converter_neural.py](cu_converter_neural.py) and [cu_converter_generative.py](cu_converter_generative.py) accordingly to convert your fields.json and labels.json files. - * For OCR conversion, the tool creates a sample CU analyzer to gather raw OCR results via an Analyze request for each original file in the DI dataset. Additionally, since the sample analyzer contains no fields, we get the results.json files without any fields as well. For more details, please refer to [get_ocr.py](get_ocr.py). -* **create_analyzer.py**: - * Once the dataset is converted to CU format, this CLI tool creates a CU analyzer while referring to the converted dataset. -* **call_analyze.py**: - * This CLI tool can be used to ensure that the migration has successfully completed and to test the quality of the previously created analyzer. + +Here is a detailed breakdown of the three CLI tools and their functionality: + +* **di_to_cu_converter.py** + * This CLI tool performs the first migration step. It converts your labeled Document Intelligence dataset into a CU-compatible dataset. The tool maps the following files accordingly: + - fields.json → analyzer.json + - DI labels.json → CU labels.json + - ocr.json → result.json + * Depending on the DI version, the tool uses either [cu_converter_neural.py](cu_converter_neural.py) or [cu_converter_generative.py](cu_converter_generative.py) to convert your fields.json and labels.json files. + * For OCR data conversion, it creates a sample CU analyzer to extract raw OCR results via an Analyze request for each original file in the DI dataset. Since the sample analyzer contains no fields, the resulting result.json files contain no fields as well. Refer to [get_ocr.py](get_ocr.py) for more details. + +* **create_analyzer.py** + * After converting the dataset to CU format, this CLI tool creates a CU analyzer referring to the converted dataset. + +* **call_analyze.py** + * This CLI tool tests that the migration completed successfully and assesses the quality of the created analyzer. ## Setup -To set up this tool, you will need to do the following steps: -1. Run the requirements.txt file to install the needed dependencies via **pip install -r ./requirements.txt** + +Follow these steps to set up the tool: + +1. Install dependencies by running: + `pip install -r ./requirements.txt` 2. Rename the file **.sample_env** to **.env** -3. Replace the following values in the **.env** file: - - **HOST:** Update this to your Azure AI service endpoint. - - Ex: "https://sample-azure-ai-resource.services.ai.azure.com" - - Avoid the "/" at the end. - ![Alt text](assets/sample-azure-resource.png "Azure AI Service") - ![Alt text](assets/endpoint.png "Azure AI Service Endpoints") - - **SUBSCRIPTION_KEY:** Update this to your Azure AI Service's API Key or Subscription ID to identify and authenticate the API request. - - You can locate your API KEY here: ![Alt text](assets/endpoint-with-keys.png "Azure AI Service Endpoints With Keys") - - If you are using AAD, please refer to your Subscription ID: ![Alt text](assets/subscription-id.png "Azure AI Service Subscription ID") - - **API_VERSION:** This version ensures that you are converting the dataset to CU Preview.2. No changes are needed here. +3. Edit the **.env** file to update the following values: + - **HOST:** Update to your Azure AI service endpoint. + - Example: `"https://sample-azure-ai-resource.services.ai.azure.com"` + - Do not include a trailing slash (`/`). + ![Azure AI Service](assets/sample-azure-resource.png) + ![Azure AI Service Endpoints](assets/endpoint.png) + - **SUBSCRIPTION_KEY:** Update to your Azure AI Service API Key or Subscription ID to authenticate the API requests. + - Locate your API Key here: ![Azure AI Service Endpoints With Keys](assets/endpoint-with-keys.png) + - If using Azure Active Directory (AAD), refer to your Subscription ID: ![Azure AI Service Subscription ID](assets/subscription-id.png) + - **API_VERSION:** This is preset to the CU Preview.2 version; no changes are needed. ## How to Locate Your Document Field Extraction Dataset for Migration -To migrate your Document Field Extraction dataset from AI Foundry, please follow the steps below: -1. On the bottom left of your Document Field Extraction project page, please select "Management Center." - ![Alt text](assets/management-center.png "Management Center") -2. Now on the Management Center page, please select "View All" from the Connected Resources section. - ![Alt text](assets/connected-resources.png "Connected Resources") -3. Within these resources, look for the resource with type "Azure Blob Storage." This resource's target URL contains the location of your dataset's storage account (in yellow) and blob container (in blue). - ![Alt text](assets/manage-connections.png "Manage Connections") - Using these values, navigate to your blob container. Then, select the "labelingProjects" folder. From there, select the folder with the same name as the blob container. Here, you'll locate all the contents of your project in the "data" folder. - - For example, the sample Document Field Extraction project is stored at - ![Alt text](assets/azure-portal.png "Azure Portal") -## How to Find Your Source and Target SAS URLs -To run migration, you will need to specify the source SAS URL (location of your Document Intelligence dataset) and target SAS URL (location for your Content Understanding dataset). +To migrate your Document Field Extraction dataset from AI Foundry, follow these steps: -To locate the SAS URL for a file or folder for any container URL arguments, please follow these steps: +1. On the bottom-left of your Document Field Extraction project page, select **Management Center**. + ![Management Center](assets/management-center.png) +2. On the Management Center page, select **View All** in the Connected Resources section. + ![Connected Resources](assets/connected-resources.png) +3. Locate the resource with type **Azure Blob Storage**. The resource's target URL contains your dataset’s storage account (highlighted in yellow) and blob container (in blue). + ![Manage Connections](assets/manage-connections.png) + Using these values, navigate to your blob container, then select the **labelingProjects** folder. Next, select the folder named after the blob container. Here you will find your project contents in the **data** folder. -1. Navigate to your storage account in Azure Portal, and from the left pane, select "Storage Browser." - ![Alt text](assets/storage-browser.png "Storage Browser") -2. Select the source/target blob container for either where your DI dataset is present or where your CU dataset will be. Click on the extended menu on the side and select "Generate SAS." - ![Alt text](assets/generate-sas.png "Generate SAS") -3. Configure the permissions and expiry for your SAS URL accordingly. +Example of a sample Document Field Extraction project location: +![Azure Portal](assets/azure-portal.png) + +## How to Find Your Source and Target SAS URLs - For the DI source dataset, please select these permissions: _**Read & List**_ +To run migration, you need to specify the source SAS URL (location of your Document Intelligence dataset) and the target SAS URL (location for your Content Understanding dataset). - For the CU target dataset, please select these permissions: _**Read, Add, Create, & Write**_ +To obtain SAS URLs for a file or folder for any container URL arguments, follow these steps: - Once configured, please select "Generate SAS Token and URL" & copy the URL shown under "Blob SAS URL." +1. In the Azure Portal, navigate to your storage account and select **Storage Browser** from the left pane. + ![Storage Browser](assets/storage-browser.png) +2. Select the source or target blob container where your DI dataset resides or where your CU dataset will be stored. Click the extended menu and select **Generate SAS**. + ![Generate SAS](assets/generate-sas.png) +3. Configure permissions and expiry for your SAS URL as follows: - ![Alt text](assets/generate-sas-pop-up.png "Generate SAS Pop-Up") + - For the **DI source dataset**, select permissions: _**Read & List**_ + - For the **CU target dataset**, select permissions: _**Read, Add, Create, & Write**_ -Notes: + After configuring, click **Generate SAS Token and URL** and copy the URL shown under **Blob SAS URL**. + + ![Generate SAS Pop-Up](assets/generate-sas-pop-up.png) -- Since SAS URL does not point to a specific folder, to ensure the correct path for source and target, please specify the correct dataset folder as --source-blob-folder or --target-blob-folder. -- To get the SAS URL for a single file, navigate to the specific file and repeat the steps above, such as: - ![Alt text](assets/individual-file-generate-sas.png "Generate SAS for Individual File") +**Notes:** +- SAS URLs do not specify a specific folder. To ensure the correct paths for source and target datasets, specify the dataset folder using `--source-blob-folder` and `--target-blob-folder`. +- To generate a SAS URL for a specific file, navigate directly to that file and repeat the process, for example: + ![Generate SAS for Individual File](assets/individual-file-generate-sas.png) ## How to Run -To run the 3 tools, please refer to the following commands. For better readability, they are split across lines. Please remove this extra spacing before execution. -_**NOTE:** Use "" when entering in a URL._ +Below are example commands to run the three tools. For readability, commands are split across multiple lines; remove line breaks before execution. -### 1. Converting Document Intelligence to Content Understanding Dataset +_**NOTE:** Always enclose URLs in double quotes (`""`)._ -If you are migrating a _DI 3.1/4.0 GA Custom Extraction_ dataset, please run this command: +### 1. Convert Document Intelligence to Content Understanding Dataset - python ./di_to_cu_converter.py --DI-version neural --analyzer-prefix mySampleAnalyzer - --source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName - --target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName +If migrating a _DI 3.1/4.0 GA Custom Extraction_ dataset, run: -For migration of Custom Extraction DI 3.1/4.0 GA, specifying an analyzer prefix is crucial for creating a CU analyzer. Since there's no "doc_type" defined for any identification in the fields.json, the created analyzer will have an analyzer ID of the specified analyzer prefix. +``` +python ./di_to_cu_converter.py --DI-version neural --analyzer-prefix mySampleAnalyzer \ +--source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName \ +--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName +``` -If you are migrating a _DI 4.0 Preview Document Field Extraction_ dataset, please run this command: +For this migration, specifying an analyzer prefix is crucial for creating a CU analyzer. Since the fields.json does not define a "doc_type" for identification, the created analyzer ID will be the specified analyzer prefix. - python ./di_to_cu_converter.py --DI-version generative --analyzer-prefix mySampleAnalyzer - --source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName - --target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName +If migrating a _DI 4.0 Preview Document Field Extraction_ dataset, run: -For migration of Document Field Extraction DI 4.0 Preview, specifying an analyzer prefix is optional. However, if you wish to create multiple analyzers from the same analyzer.json, please add an analyzer prefix. If provided, the analyzer ID will become analyzer-prefix_doc-type. Otherwise, it will simply remain as the doc_type in the fields.json. +``` +python ./di_to_cu_converter.py --DI-version generative --analyzer-prefix mySampleAnalyzer \ +--source-container-sas-url "https://sourceStorageAccount.blob.core.windows.net/sourceContainer?sourceSASToken" --source-blob-folder diDatasetFolderName \ +--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" --target-blob-folder cuDatasetFolderName +``` -_**NOTE:** You are only allowed to create one analyzer per analyzer ID._ +For this migration, specifying an analyzer prefix is optional. However, to create multiple analyzers from the same analyzer.json, add an analyzer prefix. If provided, the analyzer ID becomes `analyzer-prefix_doc-type`; otherwise, it remains as the `doc_type` in fields.json. -### 2. Creating an Analyzer +_**NOTE:** Only one analyzer can be created per analyzer ID._ -To create an analyzer using the converted CU analyzer.json, please run this command: +### 2. Create an Analyzer - python ./create_analyzer.py - --analyzer-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer/cuDatasetFolderName/analyzer.json?targetSASToken" - --target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" - --target-blob-folder cuDatasetFolderName +After converting the CU analyzer.json, run: -The analyzer.json file is stored in the specified target blob container and folder. Please get the SAS URL for the analyzer.json file from there. +``` +python ./create_analyzer.py \ +--analyzer-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer/cuDatasetFolderName/analyzer.json?targetSASToken" \ +--target-container-sas-url "https://targetStorageAccount.blob.core.windows.net/targetContainer?targetSASToken" \ +--target-blob-folder cuDatasetFolderName +``` -Additionally, please use the analyzer ID from this output when running the call_analyze.py tool. +The `analyzer.json` file is in the specified target blob container and folder. Obtain the SAS URL for `analyzer.json` from there. -Ex: +Use the analyzer ID output here for the next step when running `call_analyze.py`. -![Alt text](assets/analyzer.png "Sample Analyzer Creation") +Example: +![Sample Analyzer Creation](assets/analyzer.png) -### 3. Running Analyze +### 3. Run Analyze -To analyze a specific PDF or original file, please run this command: +To analyze a specific PDF or original file, run: - python ./call_analyze.py --analyzer-id mySampleAnalyzer - --pdf-sas-url "https://storageAccount.blob.core.windows.net/container/folder/sample.pdf?SASToken" - --output-json "./desired-path-to-analyzer-results.json" +``` +python ./call_analyze.py --analyzer-id mySampleAnalyzer \ +--pdf-sas-url "https://storageAccount.blob.core.windows.net/container/folder/sample.pdf?SASToken" \ +--output-json "./desired-path-to-analyzer-results.json" +``` -For the --analyzer-id argument, please refer to the analyzer ID created in the previous step. -Additionally, specifying --output-json isn't necessary. The default location for the output is "./sample_documents/analyzer_result.json." +For `--analyzer-id`, use the analyzer ID created in the prior step. + +Specifying `--output-json` is optional; if omitted, the default output location is `./sample_documents/analyzer_result.json`. ## Possible Issues -These are some issues that you might run into when creating an analyzer or running analyze. + +Below are common issues you might encounter when creating an analyzer or running analysis. + ### Creating an Analyzer -For any **400** error, please validate the following: -- You are using a valid endpoint. Example: _https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview_ -- Your converted CU dataset may not meet the latest naming constraints. Please ensure that all the fields in your analyzer.json file meet these requirements. If not, please make the changes manually. - - Field name only starts with a letter or an underscore - - Field name length is between 1 and 64 characters - - Only uses letters, numbers, and underscores -- Your analyzer ID meets these naming requirements - - ID is between 1 and 64 characters long - - Only uses letters, numbers, dots, underscores, and hyphens +- **400 Bad Request** errors: + Validate the following: + - The endpoint URL is valid. Example: + `https://yourEndpoint/contentunderstanding/analyzers/yourAnalyzerID?api-version=2025-05-01-preview` + - Your converted CU dataset respects the naming constraints below. If needed, manually correct `analyzer.json` fields: + - Field names start with a letter or underscore + - Field name lengths are between 1 and 64 characters + - Only letters, numbers, and underscores are allowed + - Analyzer ID meets naming requirements: + - Length between 1 and 64 characters + - Only letters, numbers, dots, underscores, and hyphens allowed -A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified. +- **401 Unauthorized**: + Authentication failure. Verify your API key and/or subscription ID are correct and have access to the endpoint. + +- **409 Conflict**: + The analyzer ID has already been used. Try using a different analyzer ID. -A **409** error implies that the analyzer ID has already been used to create an analyzer. Please try using another ID. ### Calling Analyze -- A **400** error implies a potentially incorrect endpoint or SAS URL. Ensure that your endpoint is valid _(https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-05-01-preview)_ and that you are using the correct SAS URL for the document under analysis. -- A **401** error implies a failure in authentication. Please ensure that your API key and/or subscription ID are correct and that you have access to the endpoint specified. -- A **404** error implies that no analyzer exists with the analyzer ID you have specified. Mitigate it by calling the correct ID or creating an analyzer with such an ID. - -## Points to Note: -1. Make sure to use Python version 3.9 or above. -2. Signature field types (such as in the previous versions of DI) are not supported in Content Understanding yet. Thus, during migration, these signature fields will be ignored when creating the analyzer. -3. The content of training documents will be retained in Content Understanding model metadata, under storage specifically. Additional explanation can be found here: https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json -5. All the data conversion will be for Content Understanding preview.2 version only. + +- **400 Bad Request**: + Possibly incorrect endpoint or SAS URL. Ensure your endpoint is valid: + `https://yourendpoint/contentunderstanding/analyzers/yourAnalyzerID:analyze?api-version=2025-05-01-preview` + Confirm you are using the correct SAS URL for the document. + +- **401 Unauthorized**: + Authentication failure. Verify your API key and/or subscription ID. + +- **404 Not Found**: + Analyzer with the specified ID does not exist. Use the correct analyzer ID or create an analyzer with that ID. + +## Points to Note + +1. Use Python version 3.9 or higher. +2. Signature field types (e.g., in previous DI versions) are not yet supported in Content Understanding. These will be ignored during migration when creating the analyzer. +3. The content of training documents is retained in Content Understanding model metadata, under storage specifically. More details at: + https://learn.microsoft.com/en-us/legal/cognitive-services/content-understanding/transparency-note?toc=%2Fazure%2Fai-services%2Fcontent-understanding%2Ftoc.json&bc=%2Fazure%2Fai-services%2Fcontent-understanding%2Fbreadcrumb%2Ftoc.json +4. All conversions are for Content Understanding preview.2 version only. \ No newline at end of file