Amazon Comprehend Update: Amazon Comprehend now allows you to train and run PDF and Word documents for custom entity recognition. With PDF and Word formats, you can extract information from documents containing headers, lists and tables.

AWS · AWS · commit e3ef4dc27d8f · 2021-09-14T18:09:27.000Z
diff --git a/.changes/next-release/feature-AmazonComprehend-589f314.json b/.changes/next-release/feature-AmazonComprehend-589f314.json
@@ -0,0 +1,6 @@
+{
+    "type": "feature",
+    "category": "Amazon Comprehend",
+    "contributor": "",
+    "description": "Amazon Comprehend now allows you to train and run PDF and Word documents for custom entity recognition. With PDF and Word formats, you can extract information from documents containing headers, lists and tables."
+}
diff --git a/services/comprehend/src/main/resources/codegen-resources/service-2.json b/services/comprehend/src/main/resources/codegen-resources/service-2.json
@@ -1032,6 +1032,13 @@
       "min":1,
       "pattern":"^[a-zA-Z0-9](-*[a-zA-Z0-9])*"
     },
+    "AugmentedManifestsDocumentTypeFormat":{
+      "type":"string",
+      "enum":[
+        "PLAIN_TEXT_DOCUMENT",
+        "SEMI_STRUCTURED_DOCUMENT"
+      ]
+    },
     "AugmentedManifestsListItem":{
       "type":"structure",
       "required":[
@@ -1046,6 +1053,18 @@
         "AttributeNames":{
           "shape":"AttributeNamesList",
           "documentation":"<p>The JSON attribute that contains the annotations for your training documents. The number of attribute names that you specify depends on whether your augmented manifest file is the output of a single labeling job or a chained labeling job.</p> <p>If your file is the output of a single labeling job, specify the LabelAttributeName key that was used when the job was created in Ground Truth.</p> <p>If your file is the output of a chained labeling job, specify the LabelAttributeName key for one or more jobs in the chain. Each LabelAttributeName key provides the annotations from an individual job.</p>"
+        },
+        "AnnotationDataS3Uri":{
+          "shape":"S3Uri",
+          "documentation":"<p>The S3 prefix to the annotation files that are referred in the augmented manifest file.</p>"
+        },
+        "SourceDocumentsS3Uri":{
+          "shape":"S3Uri",
+          "documentation":"<p>The S3 prefix to the source files (PDFs) that are referred to in the augmented manifest file.</p>"
+        },
+        "DocumentType":{
+          "shape":"AugmentedManifestsDocumentTypeFormat",
+          "documentation":"<p>The type of augmented manifest. PlainTextDocument or SemiStructuredDocument. If you don't specify, the default is PlainTextDocument. </p> <ul> <li> <p> <code>PLAIN_TEXT_DOCUMENT</code> A document type that represents any unicode text that is encoded in UTF-8.</p> </li> <li> <p> <code>SEMI_STRUCTURED_DOCUMENT</code> A document type with positional and structural context, like a PDF. For training with Amazon Comprehend, only PDFs are supported. For inference, Amazon Comprehend support PDFs, DOCX and TXT.</p> </li> </ul>"
         }
       },
       "documentation":"<p>An augmented manifest file that provides training data for your custom model. An augmented manifest file is a labeled dataset that is produced by Amazon SageMaker Ground Truth.</p>"
@@ -2166,7 +2185,7 @@
     "DocumentClassifierArn":{
       "type":"string",
       "max":256,
-      "pattern":"arn:aws(-[^:]+)?:comprehend:[a-zA-Z0-9-]*:[0-9]{12}:document-classifier/[a-zA-Z0-9](-*[a-zA-Z0-9])*"
+      "pattern":"arn:aws(-[^:]+)?:comprehend:[a-zA-Z0-9-]*:[0-9]{12}:document-classifier/[a-zA-Z0-9](-*[a-zA-Z0-9])*(/version/[a-zA-Z0-9](-*[a-zA-Z0-9])*)?"
     },
     "DocumentClassifierAugmentedManifestsList":{
       "type":"list",
@@ -2333,6 +2352,47 @@
       },
       "documentation":"<p>Specifies one of the label or labels that categorize the document being analyzed.</p>"
     },
+    "DocumentReadAction":{
+      "type":"string",
+      "enum":[
+        "TEXTRACT_DETECT_DOCUMENT_TEXT",
+        "TEXTRACT_ANALYZE_DOCUMENT"
+      ]
+    },
+    "DocumentReadFeatureTypes":{
+      "type":"string",
+      "documentation":"<p>A list of the types of analyses to perform. This field specifies what feature types need to be extracted from the document where entity recognition is expected.</p> <ul> <li> <p> <code>TABLES</code> - Add TABLES to the list to return information about the tables that are detected in the input document. </p> </li> <li> <p> <code>FORMS</code> - Add FORMS to return detected form data. </p> </li> </ul>",
+      "enum":[
+        "TABLES",
+        "FORMS"
+      ]
+    },
+    "DocumentReadMode":{
+      "type":"string",
+      "enum":[
+        "SERVICE_DEFAULT",
+        "FORCE_DOCUMENT_READ_ACTION"
+      ]
+    },
+    "DocumentReaderConfig":{
+      "type":"structure",
+      "required":["DocumentReadAction"],
+      "members":{
+        "DocumentReadAction":{
+          "shape":"DocumentReadAction",
+          "documentation":"<p>This enum field will start with two values which will apply to PDFs:</p> <ul> <li> <p> <code>TEXTRACT_DETECT_DOCUMENT_TEXT</code> - The service calls DetectDocumentText for PDF documents per page.</p> </li> <li> <p> <code>TEXTRACT_ANALYZE_DOCUMENT</code> - The service calls AnalyzeDocument for PDF documents per page.</p> </li> </ul>"
+        },
+        "DocumentReadMode":{
+          "shape":"DocumentReadMode",
+          "documentation":"<p>This enum field provides two values:</p> <ul> <li> <p> <code>SERVICE_DEFAULT</code> - use service defaults for Document reading. For Digital PDF it would mean using an internal parser instead of Textract APIs</p> </li> <li> <p> <code>FORCE_DOCUMENT_READ_ACTION</code> - Always use specified action for DocumentReadAction, including Digital PDF. </p> </li> </ul>"
+        },
+        "FeatureTypes":{
+          "shape":"ListOfDocumentReadFeatureTypes",
+          "documentation":"<p>Specifies how the text in an input file should be processed:</p>"
+        }
+      },
+      "documentation":"<p>The input properties for a topic detection job.</p>"
+    },
     "DominantLanguage":{
       "type":"structure",
       "members":{
@@ -2648,7 +2708,7 @@
     "EntityRecognizerArn":{
       "type":"string",
       "max":256,
-      "pattern":"arn:aws(-[^:]+)?:comprehend:[a-zA-Z0-9-]*:[0-9]{12}:entity-recognizer/[a-zA-Z0-9](-*[a-zA-Z0-9])*"
+      "pattern":"arn:aws(-[^:]+)?:comprehend:[a-zA-Z0-9-]*:[0-9]{12}:entity-recognizer/[a-zA-Z0-9](-*[a-zA-Z0-9])*(/version/[a-zA-Z0-9](-*[a-zA-Z0-9])*)?"
     },
     "EntityRecognizerAugmentedManifestsList":{
       "type":"list",
@@ -3026,9 +3086,13 @@
         "InputFormat":{
           "shape":"InputFormat",
           "documentation":"<p>Specifies how the text in an input file should be processed:</p> <ul> <li> <p> <code>ONE_DOC_PER_FILE</code> - Each file is considered a separate document. Use this option when you are processing large documents, such as newspaper articles or scientific papers.</p> </li> <li> <p> <code>ONE_DOC_PER_LINE</code> - Each line in a file is considered a separate document. Use this option when you are processing many short documents, such as text messages.</p> </li> </ul>"
+        },
+        "DocumentReaderConfig":{
+          "shape":"DocumentReaderConfig",
+          "documentation":"<p>The document reader config field applies only for InputDataConfig of StartEntitiesDetectionJob. </p> <p>Use DocumentReaderConfig to provide specifications about how you want your inference documents read. Currently it applies for PDF documents in StartEntitiesDetectionJob custom inference.</p>"
         }
       },
-      "documentation":"<p>The input properties for a topic detection job.</p>"
+      "documentation":"<p>The input properties for an inference job.</p>"
     },
     "InputFormat":{
       "type":"string",
@@ -3202,7 +3266,8 @@
     },
     "KmsKeyId":{
       "type":"string",
-      "max":2048
+      "max":2048,
+      "pattern":".*"
     },
     "KmsKeyValidationException":{
       "type":"structure",
@@ -3499,6 +3564,12 @@
       "type":"list",
       "member":{"shape":"BatchDetectSyntaxItemResult"}
     },
+    "ListOfDocumentReadFeatureTypes":{
+      "type":"list",
+      "member":{"shape":"DocumentReadFeatureTypes"},
+      "max":2,
+      "min":1
+    },
     "ListOfDominantLanguages":{
       "type":"list",
       "member":{"shape":"DominantLanguage"}