Azure-Samples
diff --git a/‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea
-496 KB b/‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea
-496 KB
diff --git a/‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.labels.json renamed to ‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.jpg.labels.json b/‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.labels.json renamed to ‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.jpg.labels.json
diff --git a/‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.result.json renamed to ‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.jpg.result.json b/‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.result.json renamed to ‎data/document_training/17a84146-e910-460c-bf80-a625e6f64fea.jpg.result.json
diff --git a/‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872
-804 KB b/‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872
-804 KB
diff --git a/‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.labels.json renamed to ‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.jpg.labels.json b/‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.labels.json renamed to ‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.jpg.labels.json
diff --git a/‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.result.json renamed to ‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.jpg.result.json b/‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.result.json renamed to ‎data/document_training/29d60394-3da1-4714-abdc-ff0993009872.jpg.result.json
diff --git a/‎notebooks/analyzer_training.ipynb
Lines changed: 2 additions & 2 deletions b/‎notebooks/analyzer_training.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎notebooks/field_extraction_pro_mode.ipynb
Lines changed: 2 additions & 2 deletions b/‎notebooks/field_extraction_pro_mode.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎python/content_understanding_client.py
Lines changed: 36 additions & 7 deletions b/‎python/content_understanding_client.py
Lines changed: 36 additions & 7 deletions
@@ -23,7 +23,7 @@
     "\n",
     "## Prerequisites\n",
     "1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
-    "1. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related env variables `TRAINING_DATA_SAS_URL` and `TRAINING_DATA_PATH` into the `.env` file.\n",
+    "1. Follow steps in [Set env for trainging data](../docs/set_env_for_training_data_and_reference_doc.md) to add training data related env variables `TRAINING_DATA_SAS_URL` and `TRAINING_DATA_PATH` into the [.env](./.env) file.\n",
     "    - `TRAINING_DATA_SAS_URL`: SAS URL for your Azure Blob container. \n",
     "    - `TRAINING_DATA_PATH`: Folder path within the container to upload training data. \n",
     "1. Install packages needed to run the sample\n",
@@ -145,7 +145,7 @@
     "## Create analyzer with defined schema\n",
     "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
     "\n",
-    "We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set up in the `.env` file and used in the previous step."
+    "We use **TRAINING_DATA_SAS_URL** and **TRAINING_DATA_PATH** that's set up in the [.env](./.env) file and used in the previous step."
    ]
   },
   {
 
@@ -28,7 +28,7 @@
    "source": [
     "## Prerequisites\n",
     "1. Ensure Azure AI service is configured following [steps](../README.md#configure-azure-ai-service-resource)\n",
-    "1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the `.env` file.\n",
+    "1. If using reference documents, please follow [Set env for reference doc](../docs/set_env_for_training_data_and_reference_doc.md) to set up `REFERENCE_DOC_SAS_URL` and `REFERENCE_DOC_PATH` in the [.env](./.env) file.\n",
     "    - `REFERENCE_DOC_SAS_URL`: SAS URL for your Azure Blob container. \n",
     "    - `REFERENCE_DOC_PATH`: Folder path within the container for uploading reference docs. \n",
     "1. Install the required packages to run the sample."
@@ -181,7 +181,7 @@
     "## Create analyzer with defined schema for Pro mode\n",
     "Before creating the analyzer, you should fill in the constant ANALYZER_ID with a relevant name to your task. Here, we generate a unique suffix so this cell can be run multiple times to create different analyzers.\n",
     "\n",
-    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the `.env` file and used in the previous step."
+    "We use **REFERENCE_DOC_SAS_URL** and **REFERENCE_DOC_PATH** that's set up in the [.env](./.env) file and used in the previous step."
    ]
   },
   {
 
@@ -422,8 +422,10 @@ async def generate_training_data_on_blob(
                         await self._upload_file_to_blob(container_client, ocr_result_path, ocr_result_blob_path)
                         self._logger.info(f"Uploaded training data for {filename}")
                     else:
-                        self._logger.warning(
-                            f"Label file {label_filename} or OCR result file {ocr_result_filename} does not exist for {filename}, skipping."
+                        raise FileNotFoundError(
+                            f"Label file '{label_filename}' or OCR result file '{ocr_result_filename}' "
+                            f"does not exist in '{training_docs_folder}'. "
+                            f"Please ensure both files exist for '{filename}'."
                         )
 
     async def generate_knowledge_base_on_blob(
@@ -451,20 +453,47 @@ async def generate_knowledge_base_on_blob(
                             try:
                                 analyze_result = self.get_prebuilt_document_analyze_result(file_path)
                             except Exception as e:
-                                self._logger.error(f"Error of getting analyze result of {filename}: {e}")
-                                continue
+                                self._logger.error(
+                                    f"Error of getting analyze result of '{filename}'. "
+                                    f"Please check the error message and consider retrying or removing this file."
+                                    )
+                                raise e
                             await self._upload_json_to_blob(container_client, analyze_result, result_file_blob_path)
                         else:
-                            self._logger.info(f"Using existing result.json for {filename}")
+                            self._logger.info(f"Using existing result.json for '{filename}'")
                             result_file_path = os.path.join(dirpath, result_file_name)
                             if not os.path.exists(result_file_path):
-                                self._logger.warning(f"Result file {result_file_name} does not exist, skipping.")
-                                continue
+                                raise FileNotFoundError(
+                                    f"Result file '{result_file_name}' does not exist in '{dirpath}'. "
+                                    f"Please run analyze first or remove this file from the folder."
+                                )
                             await self._upload_file_to_blob(container_client, result_file_path, result_file_blob_path)
                         # Upload the original file
                         file_blob_path = storage_container_path_prefix + filename
                         await self._upload_file_to_blob(container_client, file_path, file_blob_path)
                         resources.append({"file": filename, "resultFile": result_file_name})
+                    elif filename.endswith(self.OCR_RESULT_FILE_SUFFIX) and skip_analyze:
+                        if filename.replace(self.OCR_RESULT_FILE_SUFFIX, "") in filenames:
+                            # skip result.json files corresponding to the file with supported document type
+                            original_filename = filename.replace(self.OCR_RESULT_FILE_SUFFIX, "")
+                            original_filename_no_ext, original_file_ext = os.path.splitext(original_filename)
+                            if self.is_supported_type_by_file_ext(original_file_ext, is_document=True):
+                                continue
+                            else:
+                                raise ValueError(
+                                    f"The original file of '{filename}' is not a supported document type, "
+                                    f"please remove the result file '{filename}' and '{original_filename}'."
+                                )
+                        else:
+                            raise ValueError(
+                                f"Result file '{filename}' is not corresponding to an original file, "
+                                f"please remove it."
+                            )
+                    else:
+                        raise ValueError(
+                            f"File '{filename}' is not a supported document type, "
+                            f"please remove it or convert it to a supported type."
+                        )
             # Upload sources.jsonl
             await self.upload_jsonl_to_blob(
                 container_client, resources, storage_container_path_prefix + self.KNOWLEDGE_SOURCE_LIST_FILE_NAME)