Skip to content

Commit 9de645a

Browse files
authored
Merge pull request #61 from aws-solutions-library-samples/feature/s3-vectorstore
Add S3 Vectors Support for Cost-Optimized Bedrock Knowledge Base Storage
2 parents bda74dc + af8c0a3 commit 9de645a

File tree

18 files changed

+2311
-312
lines changed

18 files changed

+2311
-312
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,27 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
## [0.3.16]
9+
810
### Added
11+
12+
- **S3 Vectors Support for Cost-Optimized Knowledge Base Storage**
13+
- Added S3 Vectors as alternative vector store option to OpenSearch Serverless for Bedrock Knowledge Base with lower storage costs
14+
- Custom resource Lambda implementation for S3 vector bucket and index management (using boto3 s3vectors client) with proper IAM permissions and resource cleanup
15+
- Unified Knowledge Base interface supporting both vector store types with automatic resource provisioning based on user selection
16+
917
- **CloudFormation Service Role for Delegated Deployment Access**
1018
- Added example CloudFormation service role template that enables non-administrator users to deploy and maintain IDP stacks without requiring ongoing administrator permissions
1119
- Administrators can provision the service role once with elevated privileges, then delegate deployment capabilities to developer/DevOps teams
1220
- Includes comprehensive documentation and cross-referenced deployment guides explaining the security model and setup process
1321

22+
1423
### Fixed
1524
- Fixed issue where CloudFront policy statements were still appearing in generated GovCloud templates despite CloudFront resources being removed
1625
- Fix duplicate Glue tables are created when using a document class that contains a dash (-). Resolved by replacing dash in section types with underscore character when creating the table, to align with the table name generated later by the Glue crawler - resolves #57.
1726
- Fix occasional UI error 'Failed to get document details - please try again later' - resolves #58
1827
- Fixed UI zipfile creation to exclude .aws-sam directories and .env files from deployment package
28+
- Added security recommendation to set LogLevel parameter to WARN or ERROR (not INFO) for production deployments to prevent logging of sensitive information including PII data, document contents, and S3 presigned URLs
1929

2030
## [0.3.15]
2131

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.16-wip1
1+
0.3.16-wip3

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ classes:
6868
description: List of all transactions in the statement period
6969
attributeType: list
7070
classification:
71+
maxPagesForClassification: "ALL"
7172
image:
7273
target_height: ''
7374
target_width: ''

config_library/pattern-2/lending-package-sample/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -914,6 +914,7 @@ classes:
914914
attributeType: group
915915
classification:
916916
classificationMethod: multimodalPageLevelClassification
917+
maxPagesForClassification: "ALL"
917918
image:
918919
target_height: ''
919920
target_width: ''

config_library/pattern-2/rvl-cdip-package-sample-with-few-shot-examples/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -647,6 +647,7 @@ classes:
647647
imagePath: config_library/pattern-2/few_shot_example_with_multimodal_page_classification/example-images/bank-statement-pages/
648648

649649
classification:
650+
maxPagesForClassification: "ALL"
650651
image:
651652
target_height: ''
652653
target_width: ''

config_library/pattern-2/rvl-cdip-package-sample/config.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -307,6 +307,7 @@ classes:
307307
- name: comments
308308
description: Additional notes or remarks about the document. Look for sections labeled 'notes', 'remarks', or 'comments'.
309309
classification:
310+
maxPagesForClassification: "ALL"
310311
image:
311312
target_height: ''
312313
target_width: ''

docs/classification.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,22 @@ When deciding between Text-Based Holistic Classification and MultiModal Page-Lev
181181
182182
## Customizing Classification in Pattern 2
183183
184+
### Configuration Settings
185+
186+
#### Page Limit Configuration
187+
188+
Control how many pages are used for classification:
189+
190+
```yaml
191+
classification:
192+
maxPagesForClassification: "ALL" # Default: use all pages
193+
# Or: "1", "2", "3", etc. - use only first N pages
194+
```
195+
196+
**Important**: When set to a number (e.g., `"3"`), only the first N pages are classified, but the result is applied to ALL pages in the document. This forces the entire document to be assigned a single class with one section.
197+
198+
### Prompt Components
199+
184200
In Pattern 2, you can customize classification behavior through various prompt components:
185201

186202
### System Prompts

docs/knowledge-base.md

Lines changed: 48 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,11 @@ The GenAIIDP solution includes an integrated Document Knowledge Base query featu
77

88
## How It Works
99

10-
1. **Document Indexing**
10+
1. **Document Processing & Indexing**
1111
- Processed documents are automatically indexed in a vector database
1212
- Documents are chunked into semantic segments for efficient retrieval
1313
- Each chunk maintains reference to its source document
14+
- **Ingestion Schedule**: Documents are ingested into the knowledge base every 30 minutes, so newly processed documents may not be immediately available for querying
1415

1516
2. **Interactive Query Interface**
1617
- Access through the Web UI via the "Knowledge Base" section
@@ -33,6 +34,25 @@ The GenAIIDP solution includes an integrated Document Knowledge Base query featu
3334
- **Markdown Formatting**: Responses support rich formatting for better readability
3435
- **Real-time Processing**: Get answers in seconds, even across large document collections
3536

37+
## Architecture & Vector Storage Options
38+
39+
The Knowledge Base feature supports two vector storage backends to optimize for different performance and cost requirements:
40+
41+
### Vector Store Comparison
42+
43+
| Aspect | OpenSearch Serverless | S3 Vectors |
44+
|--------|----------------------|------------|
45+
| **Query Latency** | Sub-millisecond | Sub-second |
46+
| **Pricing Model** | Always On (continuous capacity costs) | On Demand (pay-per-query) |
47+
| **Storage Cost** | Higher | 40-60% lower |
48+
| **Best For** | Real-time applications | Cost-sensitive deployments |
49+
| **Features** | Full-text search, advanced filtering | Native S3 integration |
50+
51+
### Choosing Your Vector Store
52+
53+
- **OpenSearch Serverless** (Default): Choose for applications requiring ultra-fast retrieval and real-time performance
54+
- **S3 Vectors**: Choose for cost optimization when query latency is acceptable
55+
3656
## Configuration
3757

3858
The Document Knowledge Base Query feature can be configured during stack deployment:
@@ -46,14 +66,29 @@ ShouldUseDocumentKnowledgeBase:
4666
- "false"
4767
Description: Enable/disable the Document Knowledge Base feature
4868

69+
KnowledgeBaseVectorStore:
70+
Type: String
71+
Default: "OPENSEARCH_SERVERLESS"
72+
AllowedValues:
73+
- "OPENSEARCH_SERVERLESS"
74+
- "S3_VECTORS"
75+
Description: Vector storage backend for the knowledge base
76+
4977
DocumentKnowledgeBaseModel:
5078
Type: String
5179
Default: "us.amazon.nova-pro-v1:0"
5280
Description: Bedrock model to use for knowledge base queries (e.g., "us.anthropic.claude-3-7-sonnet-20250219-v1:0")
5381
```
5482
83+
### Supported Embedding Models
84+
85+
Both vector store types support the same embedding models:
86+
- `amazon.titan-embed-text-v2:0` (default)
87+
- `cohere.embed-english-v3` (disabled by default)
88+
- `cohere.embed-multilingual-v3` (disabled by default)
89+
5590
When the feature is enabled, the solution:
56-
- Creates necessary OpenSearch resources for document indexing
91+
- Creates the selected vector storage resources (OpenSearch or S3 Vectors)
5792
- Configures API endpoints for querying the knowledge base
5893
- Adds the query interface to the Web UI
5994

@@ -111,3 +146,14 @@ The Knowledge Base feature maintains the security controls of the overall soluti
111146
- Document visibility respects user permissions
112147
- Questions and answers are processed securely within your AWS account
113148
- No data is sent to external services beyond the configured Bedrock models
149+
150+
## Future Enhancements
151+
152+
### Potential Improvements & Community Contributions
153+
- **CloudFormation Support**: When S3 Vectors gains native CloudFormation support
154+
- **Migration Tools**: Utilities to migrate between vector store types
155+
- **Hybrid Deployment**: Support for multiple Knowledge Bases with different vector stores
156+
- **Document Chunking Options**: The system currently uses default chunking strategies, with additional chunking methods available for optimization based on document types and use cases
157+
- Performance optimization suggestions
158+
- Additional embedding model support
159+
- Enhanced monitoring and alerting

docs/languages.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
SPDX-License-Identifier: MIT-0
3+
4+
# Language Support
5+
6+
When implementing Intelligent Document Processing solutions, language support is a crucial factor to consider. The approach you take depends on whether the language of your documents is supported by the components leveraged in the workflow, such as Amazon Bedrock Data Automation (BDA) or LLMs.
7+
8+
## Decision Process
9+
10+
Below is the decision tree illustrating the suggested decision process:
11+
12+
```mermaid
13+
flowchart TD
14+
Start[Documents] --> Q1{Language supported by<br/>Bedrock Data Automation - BDA?}
15+
16+
Q1 -->|Yes| BDACheck{Document quality<br/>and structure<br/>suitable for BDA?}
17+
Q1 -->|No| Pattern2Direct[Pattern 2<br/>Bedrock FMs]
18+
19+
BDACheck -->|Yes| Pattern1[Pattern 1<br/>Bedrock Data Automation - BDA]
20+
BDACheck -->|No| Pattern2Alt1[Pattern 2<br/>Bedrock FMs]
21+
22+
Pattern1 --> Accuracy1{Accuracy meets<br/>requirements?}
23+
Pattern2Direct --> Accuracy2{Accuracy meets<br/>requirements?}
24+
Pattern2Alt1 --> Accuracy3{Accuracy meets<br/>requirements?}
25+
26+
Accuracy1 -->|No| Pattern2Fallback[Pattern 2<br/>Bedrock FMs]
27+
Accuracy1 -->|Yes| Deploy1[Deploy]
28+
29+
Accuracy2 -->|No| OptimizePath2{Issue source:<br/>Classification or Extraction?}
30+
Accuracy2 -->|Yes| Deploy2[Deploy]
31+
32+
Accuracy3 -->|No| OptimizePath3{Issue source:<br/>Classification or Extraction?}
33+
Accuracy3 -->|Yes| Deploy3[Deploy]
34+
35+
OptimizePath2 -->|Classification| Pattern3A[Pattern 3<br/>UDOP model for classification]
36+
OptimizePath2 -->|Extraction| FineTuning2[Pattern 2<br/>And model fine-tuning]
37+
38+
OptimizePath3 -->|Classification| Pattern3B[Pattern 3<br/>UDOP model for classification]
39+
OptimizePath3 -->|Extraction| FineTuning3[Pattern 2<br/>And model fine-tuning]
40+
41+
Pattern2Fallback --> Accuracy4{Accuracy meets<br/>requirements?}
42+
Accuracy4 -->|Yes| Deploy4[Deploy]
43+
Accuracy4 -->|No| OptimizePath4{Issue source:<br/>Classification or Extraction?}
44+
45+
OptimizePath4 -->|Classification| Pattern3C[Pattern 3<br/>UDOP model for classification]
46+
OptimizePath4 -->|Extraction| FineTuning4[Pattern 2<br/>And model fine-tuning]
47+
```
48+
49+
## Pattern 1
50+
51+
> Pattern 1: Packet or Media processing with Bedrock Data Automation (BDA)
52+
53+
First, verify if your documents' language is supported by Amazon Bedrock Data Automation (BDA). If your language is supported by BDA, begin with Pattern 1 (BDA).
54+
55+
At the time of writing (Sep 19, 2025) BDA supports the following languages:
56+
57+
- English
58+
- Portuguese
59+
- French
60+
- Italian
61+
- Spanish
62+
- German
63+
64+
> Important Note: BDA currently does not support vertical text orientation (commonly found in Japanese and Chinese documents). For the most up-to-date information, please consult the [BDA documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/bda-limits.html).
65+
66+
If BDA's accuracy doesn't meet your requirements for your specific scenario or language, proceed to Pattern 2.
67+
68+
## Pattern 2
69+
70+
> Pattern 2: OCR → Bedrock Classification (page-level or holistic) → Bedrock Extraction
71+
72+
For this pattern, follow this structured implementation approach:
73+
74+
```mermaid
75+
flowchart TD
76+
Start[Pattern 2] --> Q1{Is full OCR transcription required<br/>for your use case?}
77+
78+
Q1 -->|Yes| RequiredOCR[Step 1A: Select required OCR backend]
79+
Q1 -->|No| OptionalOCR[Step 1B: Optional OCR path]
80+
81+
RequiredOCR --> Q2{Document language<br/>supported by Textract?}
82+
Q2 -->|Yes| TextractReq[Use Textract backend<br/>]
83+
Q2 -->|No| BedrockReq[Use Bedrock backend]
84+
85+
OptionalOCR --> Q3{Consider OCR for<br/>potential accuracy boost?}
86+
Q3 -->|Yes| Q4{Document language:<br/>supported by Textract?}
87+
Q3 -->|No| NoOCR[Disable OCR backend]
88+
89+
Q4 -->|Yes| TextractOpt[Use Textract backend]
90+
Q4 -->|No| BedrockOpt[Use Bedrock backend]
91+
92+
TextractReq --> ClassStep[Step 2: Classification and Extraction Models]
93+
BedrockReq --> ClassStep
94+
TextractOpt --> ClassStep
95+
BedrockOpt --> ClassStep
96+
NoOCR --> ClassStep
97+
98+
ClassStep --> Q6{Document language:<br/>high-resource?}
99+
100+
Q6 -->|Yes| StandardApproach[Select and test any model]
101+
Q6 -->|No| EnhancedApproach[Test multiple models<br/>Extend testing to 50+ docs]
102+
103+
StandardApproach --> Q7{Classification and Extraction<br/>accuracy meet requirements?}
104+
EnhancedApproach --> Q7
105+
106+
Q7 -->|Yes| AssessStep[Step 3: Assessment Strategy]
107+
Q7 -->|No| Optimize[Consider fine-tuning]
108+
109+
Optimize --> AssessStep
110+
AssessStep --> Deploy[Deploy]
111+
```
112+
113+
While comprehensive model selection guidance for different languages could constitute an entire documentation suite, understanding the fundamental challenges is essential for production deployments. The reality of modern language models presents a significant transparency gap where providers rarely publish detailed statements about language-specific performance characteristics or training data distribution across their model portfolio.
114+
115+
### The High-Resource vs Low-Resource Language Divide
116+
117+
The concept of language resources refers to the availability of training data, linguistic tools, and computational research investment for a given language. This divide creates a performance gap that persists across virtually all foundation models, regardless of their stated multilingual capabilities.
118+
119+
**High-resource languages** such as English, Mandarin Chinese, Spanish, French, and German typically benefit from extensive training data representation, resulting in more reliable extraction accuracy, better understanding of domain-specific terminology, and stronger performance on complex document structures.
120+
121+
**Low-resource languages** encompass a broad spectrum of languages with limited digital representation in training corpora. These languages require significantly more extensive testing and validation to achieve production-ready accuracy levels. The performance degradation can manifest in several ways: reduced accuracy in named entity recognition, challenges with domain-specific terminology, difficulty processing complex document layouts, and inconsistent handling of linguistic nuances such as morphological complexity or non-Latin scripts.
122+
123+
### Practical Implementation Approach
124+
125+
The absence of public performance statements from model providers necessitates an empirical approach to model selection. For high-resource languages, initial testing with 50-100 representative documents typically provides sufficient confidence in model performance. However, low-resource languages require substantially more comprehensive validation, often demanding 5-10 times the testing volume to achieve comparable confidence levels.
126+
127+
When working with low-resource languages, consider implementing a cascade approach where multiple models are evaluated in parallel during the pilot phase. This strategy helps identify which foundation models demonstrate the most consistent performance for your specific document types and linguistic characteristics. Additionally, establishing clear performance thresholds early in the process prevents costly iteration cycles later in deployment.
128+
129+
### OCR Backend Considerations for Language Support
130+
131+
The choice of OCR backend significantly impacts performance for different languages, particularly when working with low-resource languages or specialized document types. The IDP Accelerator supports three distinct OCR approaches, each with specific language capabilities and use cases.
132+
133+
#### Textract Backend Language Limitations
134+
135+
Amazon Textract provides robust OCR capabilities with confidence scoring, but has explicit language constraints that must be considered during backend selection. Textract can detect printed text and handwriting from the Standard English alphabet and ASCII symbols.
136+
At the time of writing (Sep 19, 2025) Textract supports English, German, French, Spanish, Italian, and Portuguese.
137+
138+
For languages outside this supported set, Textract's accuracy degrades significantly, making it unsuitable for production workloads.
139+
140+
#### Bedrock Backend for Low-Resource Languages
141+
142+
When working with languages not supported by Textract, the Bedrock OCR backend offers a compelling alternative using foundation models for text extraction. This approach leverages the multilingual capabilities of models like Claude and Nova, which can process text in hundreds of languages with varying degrees of accuracy.
143+
144+
The Bedrock backend demonstrates particular value when the extracted text will be included alongside document images in subsequent classification and extraction prompts. This multi-turn approach often compensates for OCR inaccuracies by allowing the downstream models to cross-reference the text transcription against the visual content.
145+
146+
#### Strategic OCR Disabling
147+
148+
In scenarios where full text transcription provides minimal value to downstream processing, disabling OCR entirely can improve cost efficiency. This approach works particularly well when document images contain sufficient visual information for direct image-based only processing, or when the document structure is highly standardized and predictable.
149+
150+
The decision to disable OCR should be based on empirical testing with representative document samples. If classification and extraction accuracy remains acceptable using only document images, the elimination of OCR processing can significantly reduce both latency and operational costs.
151+
152+
### Model Families Mixing
153+
154+
Using different model families for OCR versus classification and extraction can yield significant performance improvements, particularly for challenging language scenarios. For example, a deployment might use Claude for OCR text extraction while employing Nova models for subsequent classification and extraction tasks, optimizing for each model's particular strengths.
155+
156+
This approach allows teams to leverage the best multilingual OCR capabilities for text transcription while utilizing different models optimized for reasoning and structured data extraction. The key consideration is ensuring that the combined approach maintains acceptable accuracy while managing the complexity of multi-model workflows.
157+
158+
Other considerations:
159+
160+
- For documents with poor quality (e.g., handwritten text) consider alternative Bedrock Backend instead of Textract
161+
- If accuracy requirements aren't met, explore model fine-tuning options
162+
163+
## Pattern 3
164+
165+
> Pattern 3: OCR → UDOP Classification (SageMaker) → Bedrock Extraction
166+
167+
If Bedrock-based classification doesn't meet your requirements, implement Pattern 3 using Unified Document Processing (UDOP) classification.

0 commit comments

Comments
 (0)