Skip to content

Commit 09e5968

Browse files
author
Bob Strahan
committed
Merge branch 'develop' into feature/s3-vectorstore
2 parents c9c2202 + 118a6df commit 09e5968

File tree

9 files changed

+636
-1
lines changed

9 files changed

+636
-1
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,18 @@ SPDX-License-Identifier: MIT-0
1414
- Custom resource Lambda implementation for S3 vector bucket and index management (using boto3 s3vectors client) with proper IAM permissions and resource cleanup
1515
- Unified Knowledge Base interface supporting both vector store types with automatic resource provisioning based on user selection
1616

17+
- **CloudFormation Service Role for Delegated Deployment Access**
18+
- Added example CloudFormation service role template that enables non-administrator users to deploy and maintain IDP stacks without requiring ongoing administrator permissions
19+
- Administrators can provision the service role once with elevated privileges, then delegate deployment capabilities to developer/DevOps teams
20+
- Includes comprehensive documentation and cross-referenced deployment guides explaining the security model and setup process
21+
22+
1723
### Fixed
1824
- Fixed issue where CloudFront policy statements were still appearing in generated GovCloud templates despite CloudFront resources being removed
1925
- Fix duplicate Glue tables are created when using a document class that contains a dash (-). Resolved by replacing dash in section types with underscore character when creating the table, to align with the table name generated later by the Glue crawler - resolves #57.
2026
- Fix occasional UI error 'Failed to get document details - please try again later' - resolves #58
27+
- Fixed UI zipfile creation to exclude .aws-sam directories and .env files from deployment package
28+
- Added security recommendation to set LogLevel parameter to WARN or ERROR (not INFO) for production deployments to prevent logging of sensitive information including PII data, document contents, and S3 presigned URLs
2129

2230
## [0.3.15]
2331

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.16-wip1
1+
0.3.16-wip3

docs/deployment.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@ This guide covers how to deploy, build, publish, and test the GenAI Intelligent
77

88
## Deployment Options
99

10+
### Administrator Access Requirements
11+
12+
**Important**: Deploying the GenAI IDP Accelerator requires administrator access to your AWS account. However, for organizations that want to enable non-administrator users to deploy and manage IDP stacks, we provide an optional CloudFormation service role approach:
13+
14+
- **For Administrators**: Use the deployment options below with your existing administrator privileges
15+
- **For Delegated Access**: See [iam-roles/cloudformation-management/README.md](../iam-roles/cloudformation-management/README.md) for instructions on provisioning a CloudFormation service role that allows non-administrator users to deploy and maintain IDP stacks without requiring administrator permissions
16+
17+
### One-Click Deployment
18+
1019
| US East (N.Virginia) | us-east-1 | [![Launch Stack](https://cdn.rawgit.com/buildkite/cloudformation-launch-stack-button-svg/master/launch-stack.svg)](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://s3.us-east-1.amazonaws.com/aws-ml-blog-us-east-1/artifacts/genai-idp/idp-main.yaml&stackName=IDP) |
1120

1221
3. Review the template parameters and provide values as needed

docs/languages.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
SPDX-License-Identifier: MIT-0
3+
4+
# Language Support
5+
6+
When implementing Intelligent Document Processing solutions, language support is a crucial factor to consider. The approach you take depends on whether the language of your documents is supported by the components leveraged in the workflow, such as Amazon Bedrock Data Automation (BDA) or LLMs.
7+
8+
## Decision Process
9+
10+
Below is the decision tree illustrating the suggested decision process:
11+
12+
```mermaid
13+
flowchart TD
14+
Start[Documents] --> Q1{Language supported by<br/>Bedrock Data Automation - BDA?}
15+
16+
Q1 -->|Yes| BDACheck{Document quality<br/>and structure<br/>suitable for BDA?}
17+
Q1 -->|No| Pattern2Direct[Pattern 2<br/>Bedrock FMs]
18+
19+
BDACheck -->|Yes| Pattern1[Pattern 1<br/>Bedrock Data Automation - BDA]
20+
BDACheck -->|No| Pattern2Alt1[Pattern 2<br/>Bedrock FMs]
21+
22+
Pattern1 --> Accuracy1{Accuracy meets<br/>requirements?}
23+
Pattern2Direct --> Accuracy2{Accuracy meets<br/>requirements?}
24+
Pattern2Alt1 --> Accuracy3{Accuracy meets<br/>requirements?}
25+
26+
Accuracy1 -->|No| Pattern2Fallback[Pattern 2<br/>Bedrock FMs]
27+
Accuracy1 -->|Yes| Deploy1[Deploy]
28+
29+
Accuracy2 -->|No| OptimizePath2{Issue source:<br/>Classification or Extraction?}
30+
Accuracy2 -->|Yes| Deploy2[Deploy]
31+
32+
Accuracy3 -->|No| OptimizePath3{Issue source:<br/>Classification or Extraction?}
33+
Accuracy3 -->|Yes| Deploy3[Deploy]
34+
35+
OptimizePath2 -->|Classification| Pattern3A[Pattern 3<br/>UDOP model for classification]
36+
OptimizePath2 -->|Extraction| FineTuning2[Pattern 2<br/>And model fine-tuning]
37+
38+
OptimizePath3 -->|Classification| Pattern3B[Pattern 3<br/>UDOP model for classification]
39+
OptimizePath3 -->|Extraction| FineTuning3[Pattern 2<br/>And model fine-tuning]
40+
41+
Pattern2Fallback --> Accuracy4{Accuracy meets<br/>requirements?}
42+
Accuracy4 -->|Yes| Deploy4[Deploy]
43+
Accuracy4 -->|No| OptimizePath4{Issue source:<br/>Classification or Extraction?}
44+
45+
OptimizePath4 -->|Classification| Pattern3C[Pattern 3<br/>UDOP model for classification]
46+
OptimizePath4 -->|Extraction| FineTuning4[Pattern 2<br/>And model fine-tuning]
47+
```
48+
49+
## Pattern 1
50+
51+
> Pattern 1: Packet or Media processing with Bedrock Data Automation (BDA)
52+
53+
First, verify if your documents' language is supported by Amazon Bedrock Data Automation (BDA). If your language is supported by BDA, begin with Pattern 1 (BDA).
54+
55+
At the time of writing (Sep 19, 2025) BDA supports the following languages:
56+
57+
- English
58+
- Portuguese
59+
- French
60+
- Italian
61+
- Spanish
62+
- German
63+
64+
> Important Note: BDA currently does not support vertical text orientation (commonly found in Japanese and Chinese documents). For the most up-to-date information, please consult the [BDA documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-limits.html).
65+
66+
If BDA's accuracy doesn't meet your requirements for your specific scenario or language, proceed to Pattern 2.
67+
68+
## Pattern 2
69+
70+
> Pattern 2: OCR → Bedrock Classification (page-level or holistic) → Bedrock Extraction
71+
72+
For this pattern, follow this structured implementation approach:
73+
74+
```mermaid
75+
flowchart TD
76+
Start[Pattern 2] --> Q1{Is full OCR transcription required<br/>for your use case?}
77+
78+
Q1 -->|Yes| RequiredOCR[Step 1A: Select required OCR backend]
79+
Q1 -->|No| OptionalOCR[Step 1B: Optional OCR path]
80+
81+
RequiredOCR --> Q2{Document language<br/>supported by Textract?}
82+
Q2 -->|Yes| TextractReq[Use Textract backend<br/>]
83+
Q2 -->|No| BedrockReq[Use Bedrock backend]
84+
85+
OptionalOCR --> Q3{Consider OCR for<br/>potential accuracy boost?}
86+
Q3 -->|Yes| Q4{Document language:<br/>supported by Textract?}
87+
Q3 -->|No| NoOCR[Disable OCR backend]
88+
89+
Q4 -->|Yes| TextractOpt[Use Textract backend]
90+
Q4 -->|No| BedrockOpt[Use Bedrock backend]
91+
92+
TextractReq --> ClassStep[Step 2: Classification and Extraction Models]
93+
BedrockReq --> ClassStep
94+
TextractOpt --> ClassStep
95+
BedrockOpt --> ClassStep
96+
NoOCR --> ClassStep
97+
98+
ClassStep --> Q6{Document language:<br/>high-resource?}
99+
100+
Q6 -->|Yes| StandardApproach[Select and test any model]
101+
Q6 -->|No| EnhancedApproach[Test multiple models<br/>Extend testing to 50+ docs]
102+
103+
StandardApproach --> Q7{Classification and Extraction<br/>accuracy meet requirements?}
104+
EnhancedApproach --> Q7
105+
106+
Q7 -->|Yes| AssessStep[Step 3: Assessment Strategy]
107+
Q7 -->|No| Optimize[Consider fine-tuning]
108+
109+
Optimize --> AssessStep
110+
AssessStep --> Deploy[Deploy]
111+
```
112+
113+
While comprehensive model selection guidance for different languages could constitute an entire documentation suite, understanding the fundamental challenges is essential for production deployments. The reality of modern language models presents a significant transparency gap where providers rarely publish detailed statements about language-specific performance characteristics or training data distribution across their model portfolio.
114+
115+
### The High-Resource vs Low-Resource Language Divide
116+
117+
The concept of language resources refers to the availability of training data, linguistic tools, and computational research investment for a given language. This divide creates a performance gap that persists across virtually all foundation models, regardless of their stated multilingual capabilities.
118+
119+
**High-resource languages** such as English, Mandarin Chinese, Spanish, French, and German typically benefit from extensive training data representation, resulting in more reliable extraction accuracy, better understanding of domain-specific terminology, and stronger performance on complex document structures.
120+
121+
**Low-resource languages** encompass a broad spectrum of languages with limited digital representation in training corpora. These languages require significantly more extensive testing and validation to achieve production-ready accuracy levels. The performance degradation can manifest in several ways: reduced accuracy in named entity recognition, challenges with domain-specific terminology, difficulty processing complex document layouts, and inconsistent handling of linguistic nuances such as morphological complexity or non-Latin scripts.
122+
123+
### Practical Implementation Approach
124+
125+
The absence of public performance statements from model providers necessitates an empirical approach to model selection. For high-resource languages, initial testing with 50-100 representative documents typically provides sufficient confidence in model performance. However, low-resource languages require substantially more comprehensive validation, often demanding 5-10 times the testing volume to achieve comparable confidence levels.
126+
127+
When working with low-resource languages, consider implementing a cascade approach where multiple models are evaluated in parallel during the pilot phase. This strategy helps identify which foundation models demonstrate the most consistent performance for your specific document types and linguistic characteristics. Additionally, establishing clear performance thresholds early in the process prevents costly iteration cycles later in deployment.
128+
129+
### OCR Backend Considerations for Language Support
130+
131+
The choice of OCR backend significantly impacts performance for different languages, particularly when working with low-resource languages or specialized document types. The IDP Accelerator supports three distinct OCR approaches, each with specific language capabilities and use cases.
132+
133+
#### Textract Backend Language Limitations
134+
135+
Amazon Textract provides robust OCR capabilities with confidence scoring, but has explicit language constraints that must be considered during backend selection. Textract can detect printed text and handwriting from the Standard English alphabet and ASCII symbols.
136+
At the time of writing (Sep 19, 2025) Textract supports English, German, French, Spanish, Italian, and Portuguese.
137+
138+
For languages outside this supported set, Textract's accuracy degrades significantly, making it unsuitable for production workloads.
139+
140+
#### Bedrock Backend for Low-Resource Languages
141+
142+
When working with languages not supported by Textract, the Bedrock OCR backend offers a compelling alternative using foundation models for text extraction. This approach leverages the multilingual capabilities of models like Claude and Nova, which can process text in hundreds of languages with varying degrees of accuracy.
143+
144+
The Bedrock backend demonstrates particular value when the extracted text will be included alongside document images in subsequent classification and extraction prompts. This multi-turn approach often compensates for OCR inaccuracies by allowing the downstream models to cross-reference the text transcription against the visual content.
145+
146+
#### Strategic OCR Disabling
147+
148+
In scenarios where full text transcription provides minimal value to downstream processing, disabling OCR entirely can improve cost efficiency. This approach works particularly well when document images contain sufficient visual information for direct image-based only processing, or when the document structure is highly standardized and predictable.
149+
150+
The decision to disable OCR should be based on empirical testing with representative document samples. If classification and extraction accuracy remains acceptable using only document images, the elimination of OCR processing can significantly reduce both latency and operational costs.
151+
152+
### Model Families Mixing
153+
154+
Using different model families for OCR versus classification and extraction can yield significant performance improvements, particularly for challenging language scenarios. For example, a deployment might use Claude for OCR text extraction while employing Nova models for subsequent classification and extraction tasks, optimizing for each model's particular strengths.
155+
156+
This approach allows teams to leverage the best multilingual OCR capabilities for text transcription while utilizing different models optimized for reasoning and structured data extraction. The key consideration is ensuring that the combined approach maintains acceptable accuracy while managing the complexity of multi-model workflows.
157+
158+
Other considerations:
159+
160+
- For documents with poor quality (e.g., handwritten text) consider alternative Bedrock Backend instead of Textract
161+
- If accuracy requirements aren't met, explore model fine-tuning options
162+
163+
## Pattern 3
164+
165+
> Pattern 3: OCR → UDOP Classification (SageMaker) → Bedrock Extraction
166+
167+
If Bedrock-based classification doesn't meet your requirements, implement Pattern 3 using Unified Document Processing (UDOP) classification.

docs/well-architected.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,13 @@ The GenAI Intelligent Document Processing (GenAIIDP) Accelerator demonstrates st
4040

4141
### Recommendations
4242

43+
- **Production Logging Security**:
44+
- **Set the `LogLevel` parameter to WARN or ERROR (not INFO) for production deployments** to prevent sensitive information from being logged
45+
- The `LogLevel` parameter in template.yaml automatically configures logging levels across all Lambda functions, AppSync APIs, and other components
46+
- INFO level logging can inadvertently capture sensitive document contents, PII data (SSN, addresses, names), and S3 presigned URLs
47+
- For production environments, use `LogLevel: WARN` or `LogLevel: ERROR` in your CloudFormation deployment parameters
48+
- Implement log filtering and masking for any essential INFO-level logs that must be retained
49+
- Regularly audit CloudWatch log groups to ensure no sensitive information is being captured
4350
- **CloudFront Security Enhancement**:
4451
- Create a custom domain with a custom ACM certificate for the CloudFront distribution
4552
- Enforce TLS 1.2 or greater protocol in the CloudFront security policy
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
AWSTemplateFormatVersion: '2010-09-09'
2+
Description: >
3+
This template creates a CloudFormation Service Role for the IDP Accelerator solution.
4+
This role grants permissions to create, update, and delete IDP CloudFormation
5+
stacks and their resources. It follows the principle of least privilege
6+
by allowing only the necessary actions for stack management. This template also
7+
creates a user permission policy that allows users to pass the CloudFormation
8+
service role to CloudFormation. The iam:PassRole policy must be attached to
9+
the user or role that will be using the CloudFormation Service Role in order
10+
to successfully pass the role.
11+
12+
Resources:
13+
CloudFormationServiceRole:
14+
Type: AWS::IAM::Role
15+
Properties:
16+
RoleName: IDPAcceleratorCloudFormationServiceRole
17+
AssumeRolePolicyDocument:
18+
Version: '2012-10-17'
19+
Statement:
20+
- Effect: Allow
21+
Principal:
22+
Service: !Sub 'cloudformation.${AWS::URLSuffix}'
23+
Action: sts:AssumeRole
24+
Policies:
25+
- PolicyName: CloudFormationPermissions
26+
PolicyDocument:
27+
Version: '2012-10-17'
28+
Statement:
29+
- Effect: Allow
30+
Action:
31+
- cloudformation:*
32+
Resource: '*'
33+
- Effect: Allow
34+
Action:
35+
- iam:CreateRole
36+
- iam:DeleteRole
37+
- iam:UpdateRole
38+
- iam:GetRole
39+
- iam:ListRoles
40+
- iam:CreatePolicy
41+
- iam:DeletePolicy
42+
- iam:GetPolicy
43+
- iam:ListPolicies
44+
- iam:AttachRolePolicy
45+
- iam:DetachRolePolicy
46+
- iam:PutRolePolicy
47+
- iam:DeleteRolePolicy
48+
- iam:GetRolePolicy
49+
- iam:ListRolePolicies
50+
- iam:ListAttachedRolePolicies
51+
- iam:CreateServiceLinkedRole
52+
- iam:DeleteServiceLinkedRole
53+
- iam:TagRole
54+
- iam:UntagRole
55+
- iam:ListRoleTags
56+
- iam:PassRole
57+
Resource: '*'
58+
- PolicyName: IDPAcceleratorPermissions
59+
PolicyDocument:
60+
Version: '2012-10-17'
61+
Statement:
62+
- Effect: Allow
63+
Action:
64+
- lambda:*
65+
- kms:*
66+
- logs:*
67+
- cloudwatch:*
68+
- events:*
69+
- s3:*
70+
- dynamodb:*
71+
- bedrock:*
72+
- textract:*
73+
- sagemaker:*
74+
- states:*
75+
- apigateway:*
76+
- appsync:*
77+
- cognito-idp:*
78+
- cognito-identity:*
79+
- glue:*
80+
- aoss:*
81+
- cloudfront:*
82+
- wafv2:*
83+
- sns:*
84+
- sqs:*
85+
- ssm:*
86+
- secretsmanager:*
87+
- codebuild:*
88+
- application-autoscaling:*
89+
- scheduler:*
90+
- ec2:CreateVpc
91+
- ec2:DeleteVpc
92+
- ec2:DescribeVpcs
93+
- ec2:CreateSubnet
94+
- ec2:DeleteSubnet
95+
- ec2:DescribeSubnets
96+
- ec2:CreateSecurityGroup
97+
- ec2:DeleteSecurityGroup
98+
- ec2:DescribeSecurityGroups
99+
- ec2:AuthorizeSecurityGroupIngress
100+
- ec2:AuthorizeSecurityGroupEgress
101+
- ec2:RevokeSecurityGroupIngress
102+
- ec2:RevokeSecurityGroupEgress
103+
- ec2:CreateTags
104+
- ec2:DeleteTags
105+
- ec2:DescribeTags
106+
- ec2:DescribeAvailabilityZones
107+
Resource: '*'
108+
109+
PassRolePolicy:
110+
Type: AWS::IAM::ManagedPolicy
111+
Properties:
112+
ManagedPolicyName: IDP-PassRolePolicy
113+
Description: Policy to allow passing the IDP CloudFormation service role
114+
PolicyDocument:
115+
Version: '2012-10-17'
116+
Statement:
117+
- Effect: Allow
118+
Action:
119+
- iam:PassRole
120+
Resource: !GetAtt CloudFormationServiceRole.Arn
121+
122+
Outputs:
123+
ServiceRoleArn:
124+
Description: ARN of the CloudFormation service role
125+
Value: !GetAtt CloudFormationServiceRole.Arn
126+
Export:
127+
Name: !Sub '${AWS::StackName}-ServiceRoleArn'
128+
PassRolePolicyArn:
129+
Description: ARN of the PassRole policy for admins to assign to users
130+
Value: !Ref PassRolePolicy
131+
Export:
132+
Name: !Sub '${AWS::StackName}-PassRolePolicyArn'

0 commit comments

Comments
 (0)