Merge branch 'develop' into feature/s3-vectorstore

Bob Strahan · Bob Strahan · commit 09e5968583e2 · 2025-09-19T18:39:14.000Z
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,10 +14,18 @@ SPDX-License-Identifier: MIT-0
   - Custom resource Lambda implementation for S3 vector bucket and index management (using boto3 s3vectors client) with proper IAM permissions and resource cleanup
   - Unified Knowledge Base interface supporting both vector store types with automatic resource provisioning based on user selection
 
+- **CloudFormation Service Role for Delegated Deployment Access**
+  - Added example CloudFormation service role template that enables non-administrator users to deploy and maintain IDP stacks without requiring ongoing administrator permissions
+  - Administrators can provision the service role once with elevated privileges, then delegate deployment capabilities to developer/DevOps teams
+  - Includes comprehensive documentation and cross-referenced deployment guides explaining the security model and setup process
+
+
 ### Fixed
 - Fixed issue where CloudFront policy statements were still appearing in generated GovCloud templates despite CloudFront resources being removed
 - Fix duplicate Glue tables are created when using a document class that contains a dash (-). Resolved by replacing dash in section types with underscore character when creating the table, to align with the table name generated later by the Glue crawler - resolves #57.
 - Fix occasional UI error 'Failed to get document details - please try again later' - resolves #58
+- Fixed UI zipfile creation to exclude .aws-sam directories and .env files from deployment package
+- Added security recommendation to set LogLevel parameter to WARN or ERROR (not INFO) for production deployments to prevent logging of sensitive information including PII data, document contents, and S3 presigned URLs
 
 ## [0.3.15]
 
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-0.3.16-wip1
+0.3.16-wip3
diff --git a/docs/deployment.md b/docs/deployment.md
@@ -7,6 +7,15 @@ This guide covers how to deploy, build, publish, and test the GenAI Intelligent
 
 ## Deployment Options
 
+### Administrator Access Requirements
+
+**Important**: Deploying the GenAI IDP Accelerator requires administrator access to your AWS account. However, for organizations that want to enable non-administrator users to deploy and manage IDP stacks, we provide an optional CloudFormation service role approach:
+
+- **For Administrators**: Use the deployment options below with your existing administrator privileges
+- **For Delegated Access**: See [iam-roles/cloudformation-management/README.md](../iam-roles/cloudformation-management/README.md) for instructions on provisioning a CloudFormation service role that allows non-administrator users to deploy and maintain IDP stacks without requiring administrator permissions
+
+### One-Click Deployment
+
 | US East (N.Virginia)      | us-east-1   | [![Launch Stack](https://cdn.rawgit.com/buildkite/cloudformation-launch-stack-button-svg/master/launch-stack.svg)](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://s3.us-east-1.amazonaws.com/aws-ml-blog-us-east-1/artifacts/genai-idp/idp-main.yaml&stackName=IDP) |
 
 3. Review the template parameters and provide values as needed
diff --git a/docs/languages.md b/docs/languages.md
@@ -0,0 +1,167 @@
+Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+SPDX-License-Identifier: MIT-0
+
+# Language Support
+
+When implementing Intelligent Document Processing solutions, language support is a crucial factor to consider. The approach you take depends on whether the language of your documents is supported by the components leveraged in the workflow, such as Amazon Bedrock Data Automation (BDA) or LLMs.
+
+## Decision Process
+
+Below is the decision tree illustrating the suggested decision process:
+
+```mermaid
+flowchart TD
+    Start[Documents] --> Q1{Language supported by<br/>Bedrock Data Automation - BDA?}
+  
+    Q1 -->|Yes| BDACheck{Document quality<br/>and structure<br/>suitable for BDA?}
+    Q1 -->|No| Pattern2Direct[Pattern 2<br/>Bedrock FMs]
+  
+    BDACheck -->|Yes| Pattern1[Pattern 1<br/>Bedrock Data Automation - BDA]
+    BDACheck -->|No| Pattern2Alt1[Pattern 2<br/>Bedrock FMs]
+  
+    Pattern1 --> Accuracy1{Accuracy meets<br/>requirements?}
+    Pattern2Direct --> Accuracy2{Accuracy meets<br/>requirements?}
+    Pattern2Alt1 --> Accuracy3{Accuracy meets<br/>requirements?}
+  
+    Accuracy1 -->|No| Pattern2Fallback[Pattern 2<br/>Bedrock FMs]
+    Accuracy1 -->|Yes| Deploy1[Deploy]
+  
+    Accuracy2 -->|No| OptimizePath2{Issue source:<br/>Classification or Extraction?}
+    Accuracy2 -->|Yes| Deploy2[Deploy]
+  
+    Accuracy3 -->|No| OptimizePath3{Issue source:<br/>Classification or Extraction?}
+    Accuracy3 -->|Yes| Deploy3[Deploy]
+  
+    OptimizePath2 -->|Classification| Pattern3A[Pattern 3<br/>UDOP model for classification]
+    OptimizePath2 -->|Extraction| FineTuning2[Pattern 2<br/>And model fine-tuning]
+  
+    OptimizePath3 -->|Classification| Pattern3B[Pattern 3<br/>UDOP model for classification]
+    OptimizePath3 -->|Extraction| FineTuning3[Pattern 2<br/>And model fine-tuning]
+  
+    Pattern2Fallback --> Accuracy4{Accuracy meets<br/>requirements?}
+    Accuracy4 -->|Yes| Deploy4[Deploy]
+    Accuracy4 -->|No| OptimizePath4{Issue source:<br/>Classification or Extraction?}
+  
+    OptimizePath4 -->|Classification| Pattern3C[Pattern 3<br/>UDOP model for classification]
+    OptimizePath4 -->|Extraction| FineTuning4[Pattern 2<br/>And model fine-tuning]
+```
+
+## Pattern 1
+
+> Pattern 1: Packet or Media processing with Bedrock Data Automation (BDA)
+
+First, verify if your documents' language is supported by Amazon Bedrock Data Automation (BDA). If your language is supported by BDA, begin with Pattern 1 (BDA).
+
+At the time of writing (Sep 19, 2025) BDA supports the following languages:
+
+- English
+- Portuguese
+- French
+- Italian
+- Spanish
+- German
+
+> Important Note: BDA currently does not support vertical text orientation (commonly found in Japanese and Chinese documents). For the most up-to-date information, please consult the [BDA documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-limits.html).
+
+If BDA's accuracy doesn't meet your requirements for your specific scenario or language, proceed to Pattern 2.
+
+## Pattern 2
+
+> Pattern 2: OCR → Bedrock Classification (page-level or holistic) → Bedrock Extraction
+
+For this pattern, follow this structured implementation approach:
+
+```mermaid
+flowchart TD
+    Start[Pattern 2] --> Q1{Is full OCR transcription required<br/>for your use case?}
+  
+    Q1 -->|Yes| RequiredOCR[Step 1A: Select required OCR backend]
+    Q1 -->|No| OptionalOCR[Step 1B: Optional OCR path]
+  
+    RequiredOCR --> Q2{Document language<br/>supported by Textract?}
+    Q2 -->|Yes| TextractReq[Use Textract backend<br/>]
+    Q2 -->|No| BedrockReq[Use Bedrock backend]
+  
+    OptionalOCR --> Q3{Consider OCR for<br/>potential accuracy boost?}
+    Q3 -->|Yes| Q4{Document language:<br/>supported by Textract?}
+    Q3 -->|No| NoOCR[Disable OCR backend]
+  
+    Q4 -->|Yes| TextractOpt[Use Textract backend]
+    Q4 -->|No| BedrockOpt[Use Bedrock backend]
+  
+    TextractReq --> ClassStep[Step 2: Classification and Extraction Models]
+    BedrockReq --> ClassStep
+    TextractOpt --> ClassStep
+    BedrockOpt --> ClassStep
+    NoOCR --> ClassStep
+  
+    ClassStep --> Q6{Document language:<br/>high-resource?}
+  
+    Q6 -->|Yes| StandardApproach[Select and test any model]
+    Q6 -->|No| EnhancedApproach[Test multiple models<br/>Extend testing to 50+ docs]
+  
+    StandardApproach --> Q7{Classification and Extraction<br/>accuracy meet requirements?}
+    EnhancedApproach --> Q7
+  
+    Q7 -->|Yes| AssessStep[Step 3: Assessment Strategy]
+    Q7 -->|No| Optimize[Consider fine-tuning]
+  
+    Optimize --> AssessStep
+    AssessStep --> Deploy[Deploy]
+```
+
+While comprehensive model selection guidance for different languages could constitute an entire documentation suite, understanding the fundamental challenges is essential for production deployments. The reality of modern language models presents a significant transparency gap where providers rarely publish detailed statements about language-specific performance characteristics or training data distribution across their model portfolio.
+
+### The High-Resource vs Low-Resource Language Divide
+
+The concept of language resources refers to the availability of training data, linguistic tools, and computational research investment for a given language. This divide creates a performance gap that persists across virtually all foundation models, regardless of their stated multilingual capabilities.
+
+**High-resource languages** such as English, Mandarin Chinese, Spanish, French, and German typically benefit from extensive training data representation, resulting in more reliable extraction accuracy, better understanding of domain-specific terminology, and stronger performance on complex document structures.
+
+**Low-resource languages** encompass a broad spectrum of languages with limited digital representation in training corpora. These languages require significantly more extensive testing and validation to achieve production-ready accuracy levels. The performance degradation can manifest in several ways: reduced accuracy in named entity recognition, challenges with domain-specific terminology, difficulty processing complex document layouts, and inconsistent handling of linguistic nuances such as morphological complexity or non-Latin scripts.
+
+### Practical Implementation Approach
+
+The absence of public performance statements from model providers necessitates an empirical approach to model selection. For high-resource languages, initial testing with 50-100 representative documents typically provides sufficient confidence in model performance. However, low-resource languages require substantially more comprehensive validation, often demanding 5-10 times the testing volume to achieve comparable confidence levels.
+
+When working with low-resource languages, consider implementing a cascade approach where multiple models are evaluated in parallel during the pilot phase. This strategy helps identify which foundation models demonstrate the most consistent performance for your specific document types and linguistic characteristics. Additionally, establishing clear performance thresholds early in the process prevents costly iteration cycles later in deployment.
+
+### OCR Backend Considerations for Language Support
+
+The choice of OCR backend significantly impacts performance for different languages, particularly when working with low-resource languages or specialized document types. The IDP Accelerator supports three distinct OCR approaches, each with specific language capabilities and use cases.
+
+#### Textract Backend Language Limitations
+
+Amazon Textract provides robust OCR capabilities with confidence scoring, but has explicit language constraints that must be considered during backend selection. Textract can detect printed text and handwriting from the Standard English alphabet and ASCII symbols.
+At the time of writing (Sep 19, 2025) Textract supports English, German, French, Spanish, Italian, and Portuguese.
+
+For languages outside this supported set, Textract's accuracy degrades significantly, making it unsuitable for production workloads.
+
+#### Bedrock Backend for Low-Resource Languages
+
+When working with languages not supported by Textract, the Bedrock OCR backend offers a compelling alternative using foundation models for text extraction. This approach leverages the multilingual capabilities of models like Claude and Nova, which can process text in hundreds of languages with varying degrees of accuracy.
+
+The Bedrock backend demonstrates particular value when the extracted text will be included alongside document images in subsequent classification and extraction prompts. This multi-turn approach often compensates for OCR inaccuracies by allowing the downstream models to cross-reference the text transcription against the visual content.
+
+#### Strategic OCR Disabling
+
+In scenarios where full text transcription provides minimal value to downstream processing, disabling OCR entirely can improve cost efficiency. This approach works particularly well when document images contain sufficient visual information for direct image-based only processing, or when the document structure is highly standardized and predictable.
+
+The decision to disable OCR should be based on empirical testing with representative document samples. If classification and extraction accuracy remains acceptable using only document images, the elimination of OCR processing can significantly reduce both latency and operational costs.
+
+### Model Families Mixing
+
+Using different model families for OCR versus classification and extraction can yield significant performance improvements, particularly for challenging language scenarios. For example, a deployment might use Claude for OCR text extraction while employing Nova models for subsequent classification and extraction tasks, optimizing for each model's particular strengths.
+
+This approach allows teams to leverage the best multilingual OCR capabilities for text transcription while utilizing different models optimized for reasoning and structured data extraction. The key consideration is ensuring that the combined approach maintains acceptable accuracy while managing the complexity of multi-model workflows.
+
+Other considerations:
+
+- For documents with poor quality (e.g., handwritten text) consider alternative Bedrock Backend instead of Textract
+- If accuracy requirements aren't met, explore model fine-tuning options
+
+## Pattern 3
+
+> Pattern 3: OCR → UDOP Classification (SageMaker) → Bedrock Extraction
+
+If Bedrock-based classification doesn't meet your requirements, implement Pattern 3 using Unified Document Processing (UDOP) classification.
diff --git a/docs/well-architected.md b/docs/well-architected.md
@@ -40,6 +40,13 @@ The GenAI Intelligent Document Processing (GenAIIDP) Accelerator demonstrates st
 
 ### Recommendations
 
+- **Production Logging Security**: 
+  - **Set the `LogLevel` parameter to WARN or ERROR (not INFO) for production deployments** to prevent sensitive information from being logged
+  - The `LogLevel` parameter in template.yaml automatically configures logging levels across all Lambda functions, AppSync APIs, and other components
+  - INFO level logging can inadvertently capture sensitive document contents, PII data (SSN, addresses, names), and S3 presigned URLs
+  - For production environments, use `LogLevel: WARN` or `LogLevel: ERROR` in your CloudFormation deployment parameters
+  - Implement log filtering and masking for any essential INFO-level logs that must be retained
+  - Regularly audit CloudWatch log groups to ensure no sensitive information is being captured
 - **CloudFront Security Enhancement**: 
   - Create a custom domain with a custom ACM certificate for the CloudFront distribution
   - Enforce TLS 1.2 or greater protocol in the CloudFront security policy
diff --git a/iam-roles/cloudformation-management/IDP-Cloudformation-Service-Role.yaml b/iam-roles/cloudformation-management/IDP-Cloudformation-Service-Role.yaml
@@ -0,0 +1,132 @@
+AWSTemplateFormatVersion: '2010-09-09'
+Description: >
+  This template creates a CloudFormation Service Role for the IDP Accelerator solution. 
+  This role grants permissions to create, update, and delete IDP CloudFormation
+  stacks and their resources. It follows the principle of least privilege
+  by allowing only the necessary actions for stack management. This template also
+  creates a user permission policy that allows users to pass the CloudFormation
+  service role to CloudFormation. The iam:PassRole policy must be attached to
+  the user or role that will be using the CloudFormation Service Role in order
+  to successfully pass the role.
+
+Resources:
+  CloudFormationServiceRole:
+    Type: AWS::IAM::Role
+    Properties:
+      RoleName: IDPAcceleratorCloudFormationServiceRole
+      AssumeRolePolicyDocument:
+        Version: '2012-10-17'
+        Statement:
+          - Effect: Allow
+            Principal:
+              Service: !Sub 'cloudformation.${AWS::URLSuffix}'
+            Action: sts:AssumeRole
+      Policies:
+        - PolicyName: CloudFormationPermissions
+          PolicyDocument:
+            Version: '2012-10-17'
+            Statement:
+              - Effect: Allow
+                Action:
+                  - cloudformation:*
+                Resource: '*'
+              - Effect: Allow
+                Action:
+                  - iam:CreateRole
+                  - iam:DeleteRole
+                  - iam:UpdateRole
+                  - iam:GetRole
+                  - iam:ListRoles
+                  - iam:CreatePolicy
+                  - iam:DeletePolicy
+                  - iam:GetPolicy
+                  - iam:ListPolicies
+                  - iam:AttachRolePolicy
+                  - iam:DetachRolePolicy
+                  - iam:PutRolePolicy
+                  - iam:DeleteRolePolicy
+                  - iam:GetRolePolicy
+                  - iam:ListRolePolicies
+                  - iam:ListAttachedRolePolicies
+                  - iam:CreateServiceLinkedRole
+                  - iam:DeleteServiceLinkedRole
+                  - iam:TagRole
+                  - iam:UntagRole
+                  - iam:ListRoleTags
+                  - iam:PassRole
+                Resource: '*'
+        - PolicyName: IDPAcceleratorPermissions
+          PolicyDocument:
+            Version: '2012-10-17'
+            Statement:
+              - Effect: Allow
+                Action:
+                  - lambda:*
+                  - kms:*
+                  - logs:*
+                  - cloudwatch:*
+                  - events:*
+                  - s3:*
+                  - dynamodb:*
+                  - bedrock:*
+                  - textract:*
+                  - sagemaker:*
+                  - states:*
+                  - apigateway:*
+                  - appsync:*
+                  - cognito-idp:*
+                  - cognito-identity:*
+                  - glue:*
+                  - aoss:*
+                  - cloudfront:*
+                  - wafv2:*
+                  - sns:*
+                  - sqs:*
+                  - ssm:*
+                  - secretsmanager:*
+                  - codebuild:*
+                  - application-autoscaling:*
+                  - scheduler:*
+                  - ec2:CreateVpc
+                  - ec2:DeleteVpc
+                  - ec2:DescribeVpcs
+                  - ec2:CreateSubnet
+                  - ec2:DeleteSubnet
+                  - ec2:DescribeSubnets
+                  - ec2:CreateSecurityGroup
+                  - ec2:DeleteSecurityGroup
+                  - ec2:DescribeSecurityGroups
+                  - ec2:AuthorizeSecurityGroupIngress
+                  - ec2:AuthorizeSecurityGroupEgress
+                  - ec2:RevokeSecurityGroupIngress
+                  - ec2:RevokeSecurityGroupEgress
+                  - ec2:CreateTags
+                  - ec2:DeleteTags
+                  - ec2:DescribeTags
+                  - ec2:DescribeAvailabilityZones
+                Resource: '*'
+
+  PassRolePolicy:
+    Type: AWS::IAM::ManagedPolicy
+    Properties:
+      ManagedPolicyName: IDP-PassRolePolicy
+      Description: Policy to allow passing the IDP CloudFormation service role
+      PolicyDocument:
+        Version: '2012-10-17'
+        Statement:
+          - Effect: Allow
+            Action:
+              - iam:PassRole
+            Resource: !GetAtt CloudFormationServiceRole.Arn
+
+Outputs:
+  ServiceRoleArn:
+    Description: ARN of the CloudFormation service role
+    Value: !GetAtt CloudFormationServiceRole.Arn
+    Export:
+      Name: !Sub '${AWS::StackName}-ServiceRoleArn'
+  PassRolePolicyArn:
+    Description: ARN of the PassRole policy for admins to assign to users
+    Value: !Ref PassRolePolicy
+    Export:
+      Name: !Sub '${AWS::StackName}-PassRolePolicyArn'
diff --git a/iam-roles/cloudformation-management/README.md b/iam-roles/cloudformation-management/README.md
diff --git a/iam-roles/cloudformation-management/testing-guide.md b/iam-roles/cloudformation-management/testing-guide.md
diff --git a/publish.py b/publish.py