Skip to content

Commit a58204f

Browse files
author
Bob Strahan
committed
Merge branch 'develop'
2 parents 53a6544 + b9d02f5 commit a58204f

File tree

145 files changed

+25476
-1725
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

145 files changed

+25476
-1725
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,5 @@ build/
1515
__pycache__
1616
*.code-workspace
1717
.ruff_cache
18+
.kiro
1819
rvl_cdip_*

.gitlab-ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ developer_tests:
2828
- apt-get update -y
2929
- apt-get install make -y
3030
- pip install ruff
31+
# Install test dependencies
32+
- cd lib/idp_common_pkg && pip install -e ".[test]" && cd ../..
3133

3234
script:
3335
- make lint-cicd

1751146101381_classification_state.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

CHANGELOG.md

Lines changed: 98 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,110 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
## [0.3.6]
9+
10+
### Fixed
11+
- Update Athena/Glue table configuration to use Parquet format instead of JSON #20
12+
- Cloudformation Error when Changing Evaluation Bucket Name #19
13+
14+
### Added
15+
- **Extended Document Format Support in OCR Service**
16+
- Added support for processing additional document formats beyond PDF and images:
17+
- Plain text (.txt) files with automatic pagination for large documents
18+
- CSV (.csv) files with table visualization and structured output
19+
- Excel workbooks (.xlsx, .xls) with multi-sheet support (each sheet as a page)
20+
- Word documents (.docx, .doc) with text extraction and visual representation
21+
- **Key Features**:
22+
- Consistent processing model across all document formats
23+
- Standard page image generation for all formats
24+
- Structured text output in formats compatible with existing extraction pipelines
25+
- Confidence metrics for all document types
26+
- Automatic format detection from file content and extension
27+
- **Implementation Details**:
28+
- Format-specific processing strategies for optimal results
29+
- Enhanced text rendering for plain text documents
30+
- Table visualization for CSV and Excel data
31+
- Word document paragraph extraction with formatting preservation
32+
- S3 storage integration matching existing PDF processing workflow
33+
34+
## [0.3.5]
35+
36+
### Added
37+
- **Human-in-the-Loop (HITL) Support - Pattern 1**
38+
- Added comprehensive Human-in-the-Loop review capabilities using Amazon SageMaker Augmented AI (A2I)
39+
- **Key Features**:
40+
- Automatic triggering when extraction confidence falls below configurable threshold
41+
- Integration with SageMaker A2I Review Portal for human validation and correction
42+
- Configurable confidence threshold through Web UI Portal Configuration tab (0.0-1.0 range)
43+
- Seamless result integration with human-verified data automatically updating source results
44+
- **Workflow Integration**:
45+
- HITL tasks created automatically when confidence thresholds are not met
46+
- Reviewers can validate correct extractions or make necessary corrections through the Review Portal
47+
- Document processing continues with human-verified data after review completion
48+
- **Configuration Management**:
49+
- `EnableHITL` parameter for feature toggle
50+
- Confidence threshold configurable via Web UI without stack redeployment
51+
- Support for existing private workforce work teams via input parameter
52+
- **CloudFormation Output**: Added `SageMakerA2IReviewPortalURL` for easy access to review portal
53+
- **Known Limitations**: Current A2I version cannot provide direct hyperlinks to specific document tasks; template updates require resource recreation
54+
- **Document Compression for Large Documents - all patterns**
55+
- Added automatic compression support to handle large documents and avoid exceeding Step Functions payload limits (256KB)
56+
- **Key Features**:
57+
- Automatic compression (default trigger threshold of 0KB enables compression by default)
58+
- Transparent handling of both compressed and uncompressed documents in Lambda functions
59+
- Temporary S3 storage for compressed document state with automatic cleanup via lifecycle policies
60+
- **New Utility Methods**:
61+
- `Document.load_document()`: Automatically detects and decompresses document input from Lambda events
62+
- `Document.serialize_document()`: Automatically compresses large documents for Lambda responses
63+
- `Document.compress()` and `Document.decompress()`: Compression/decompression methods
64+
- **Lambda Function Integration**: All relevant Lambda functions updated to use compression utilities
65+
- **Resolves Step Functions Errors**: Eliminates "result with a size exceeding the maximum number of bytes service limit" errors for large multi-page documents
66+
- **Multi-Backend OCR Support - Pattern 2 and 3**
67+
- Textract Backend (default): Existing AWS Textract functionality
68+
- Bedrock Backend: New LLM-based OCR using Claude/Nova models
69+
- None Backend: Image-only processing without OCR
70+
- **Bedrock OCR Integration - Pattern 2 and 3**
71+
- Customizable system and task prompts for OCR optimization
72+
- Better handling of complex documents, tables, and forms
73+
- Layout preservation capabilities
74+
- **Image Preprocessing - Pattern 2**
75+
- Adaptive Binarization: Improves OCR accuracy on documents with:
76+
- Uneven lighting or shadows
77+
- Low contrast text
78+
- Background noise or gradients
79+
- Optional feature with configurable enable/disable
80+
- **YAML Parsing Support for LLM Responses - Pattern 2 and 3**
81+
- Added comprehensive YAML parsing capabilities to complement existing JSON parsing functionality
82+
- New `extract_yaml_from_text()` function with robust multi-strategy YAML extraction:
83+
- YAML in ```yaml and ```yml code blocks
84+
- YAML with document markers (---)
85+
- Pattern-based YAML detection using indentation and key indicators
86+
- New `detect_format()` function for automatic format detection returning 'json', 'yaml', or 'unknown'
87+
- New unified `extract_structured_data_from_text()` wrapper function that automatically detects and parses both JSON and YAML formats
88+
- **Token Efficiency**: YAML typically uses 10-30% fewer tokens than equivalent JSON due to more compact syntax
89+
- **Service Integration**: Updated classification service to use the new unified parsing function with automatic fallback between formats
90+
- **Comprehensive Testing**: Added 39 new unit tests covering all YAML extraction strategies, format detection, and edge cases
91+
- **Backward Compatibility**: All existing JSON functionality preserved unchanged, new functionality is purely additive
92+
- **Intelligent Fallback**: Robust fallback mechanism handles cases where preferred format fails (e.g., JSON requested as YAML falls back to JSON)
93+
- **Production Ready**: Handles malformed content gracefully, comprehensive error handling and logging
94+
- **Example Notebook**: Added `notebooks/examples/step3_extraction_using_yaml.ipynb` demonstrating YAML-based extraction with automatic format detection and token efficiency benefits
95+
96+
### Fixed
97+
- **Enhanced JSON Extraction from LLM Responses (Issue #16)**
98+
- Modularized duplicate `_extract_json()` functions across classification, extraction, summarization, and assessment services into a common `extract_json_from_text()` utility function
99+
- Improved multi-line JSON handling with literal newlines in string values that previously caused parsing failures
100+
- Added robust JSON validation and multiple fallback strategies for better extraction reliability
101+
- Enhanced string parsing with proper escape sequence handling for quotes and newlines
102+
- Added comprehensive unit tests covering various JSON formats including multi-line scenarios
103+
8104
## [0.3.4]
9105

10106
### Added
11107
- **Configurable Image Processing and Enhanced Resizing Logic**
12108
- **Improved Image Resizing Algorithm**: Enhanced aspect-ratio preserving scaling that only downsizes when necessary (scale factor < 1.0) to prevent image distortion
13109
- **Configurable Image Dimensions**: All processing services (Assessment, Classification, Extraction, OCR) now support configurable image dimensions through configuration with default 951×1268 resolution
14110
- **Service-Specific Image Optimization**: Each service can use optimal image dimensions for performance and quality tuning
15-
- **Enhanced OCR Service**: Added configurable DPI for PDF-to-image conversion (default: 300) and optional image resizing with dual image strategy (stores original high-DPI images while using resized images for processing)
111+
- **Enhanced OCR Service**: Added configurable DPI for PDF-to-image conversion and optional image resizing with dual image strategy (stores original high-DPI images while using resized images for processing)
16112
- **Runtime Configuration**: No code changes needed to adjust image processing - all configurable through service configuration
17113
- **Backward Compatibility**: Default values maintain existing behavior with no immediate action required for existing deployments
18114
- **Enhanced Configuration Management**
@@ -308,7 +404,7 @@ The `idp_common_pkg` introduces a unified Document model approach for consistent
308404
- **Section**: Represents logical document sections with classification and extraction results
309405

310406
#### Service Classes
311-
- **OcrService**: Processes documents with AWS Textract and updates the Document with OCR results
407+
- **OcrService**: Processes documents with AWS Textract or Amazon Bedrock and updates the Document with OCR results
312408
- **ClassificationService**: Classifies document pages/sections using Bedrock or SageMaker backends
313409
- **ExtractionService**: Extracts structured information from document sections using Bedrock
314410

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,4 @@ commit: lint test
4646
export COMMIT_MESSAGE="$(shell q chat --no-interactive --trust-all-tools "Understand pending local git change and changes to be committed, then infer a commit message. Return this commit message only" | tail -n 1 | sed 's/\x1b\[[0-9;]*m//g')" && \
4747
git add . && \
4848
git commit -am "$${COMMIT_MESSAGE}" && \
49-
git push
49+
git push

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ After deployment, you can quickly process a document and view results:
7474

7575
2. **Use Sample Documents**:
7676
- For Pattern 1 (BDA): Use [samples/lending_package.pdf](./samples/lending_package.pdf)
77-
- For Patterns 2 and 3: Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
77+
- For Patterns 2 and 3: Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
7878

7979
3. **Monitor Processing**:
8080
- **Via Web UI**: Track document status on the dashboard
@@ -98,7 +98,9 @@ To update an existing GenAIIDP stack to a new version:
9898
2. Select your existing stack
9999
3. Click "Update"
100100
4. Select "Replace current template"
101-
5. Enter the template URL: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main.yaml`
101+
5. Enter the template URL:
102+
- us-west-2: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main.yaml`
103+
- us-east-1: `https://s3.us-east-1.amazonaws.com/aws-ml-blog-us-east-1/artifacts/genai-idp/idp-main.yaml`
102104
6. Follow the prompts to update your stack, reviewing any parameter changes
103105
7. For detailed instructions, see the [Deployment Guide](./docs/deployment.md#updating-an-existing-stack)
104106

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.3.4
1+
0.3.6

config_library/pattern-1/default/config.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
# SPDX-License-Identifier: MIT-0
33

44
notes: Processing configuration in BDA project.
5+
assessment:
6+
default_confidence_threshold: '0.8'
57
summarization:
68
top_p: '0.1'
79
max_tokens: '4096'

config_library/pattern-2/README.md

Lines changed: 57 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,65 @@ This directory contains configurations for Pattern 2 of the GenAI IDP Accelerato
1010
Pattern 2 implements an intelligent document processing workflow that uses Amazon Bedrock with Nova or Claude models for both page classification/grouping and information extraction.
1111

1212
Key components of Pattern 2:
13-
- OCR processing using Amazon Textract
14-
- Document classification using Claude via Amazon Bedrock (with two available methods):
13+
- **OCR processing** with multiple backend options (Textract, Bedrock LLM, or image-only)
14+
- **Document classification** using Claude via Amazon Bedrock (with two available methods):
1515
- Page-level classification: Classifies individual pages and groups them
1616
- Holistic packet classification: Analyzes multi-document packets to identify document boundaries
17-
- Field extraction using Claude via Amazon Bedrock
17+
- **Field extraction** using Claude via Amazon Bedrock
18+
- **Assessment functionality** for confidence evaluation of extraction results
19+
20+
## OCR Backend Selection for Pattern 2
21+
22+
Pattern 2 supports multiple OCR backends, each with different implications for the assessment feature:
23+
24+
### Textract Backend (Default - Recommended)
25+
- **Best for**: Production workflows, when assessment is enabled
26+
- **Assessment Impact**: ✅ Full assessment capability with granular confidence scores
27+
- **Text Confidence Data**: Rich confidence information for each text block
28+
- **Cost**: Standard Textract pricing
29+
30+
### Bedrock Backend (LLM-based OCR)
31+
- **Best for**: Challenging documents where traditional OCR fails
32+
- **Assessment Impact**: ❌ Assessment disabled - no confidence data available
33+
- **Text Confidence Data**: Empty (no confidence scores from LLM OCR)
34+
- **Cost**: Bedrock LLM inference costs
35+
36+
### None Backend (Image-only)
37+
- **Best for**: Custom OCR integration, image-only workflows
38+
- **Assessment Impact**: ❌ Assessment disabled - no OCR text available
39+
- **Text Confidence Data**: Empty
40+
- **Cost**: No OCR costs
41+
42+
> ⚠️ **Assessment Recommendation**: Use Textract backend (default) when assessment functionality is required. Bedrock and None backends eliminate assessment capability due to lack of confidence data.
43+
44+
## Text Confidence Data and Assessment Integration
45+
46+
Pattern 2's assessment feature relies on text confidence data generated during the OCR phase to evaluate extraction quality and provide confidence scores for each extracted attribute.
47+
48+
### How Text Confidence Data Enables Assessment
49+
50+
1. **OCR Phase**: Textract generates confidence scores for each text block during document processing
51+
2. **Condensed Format**: OCR service creates optimized `textConfidence.json` files with 80-90% token reduction
52+
3. **Assessment Phase**: LLM analyzes extraction results against OCR confidence data to provide accurate confidence evaluation
53+
4. **UI Integration**: Assessment results appear in the web interface with color-coded confidence indicators
54+
55+
### Assessment Workflow Impact by OCR Backend
56+
57+
**With Textract Backend:**
58+
```
59+
Document → Textract OCR → Rich Confidence Data → Assessment LLM → Confidence Scores
60+
```
61+
- Assessment LLM receives detailed confidence information for each text region
62+
- Can accurately evaluate extraction confidence based on OCR quality
63+
- Provides meaningful confidence scores and explanations
64+
65+
**With Bedrock/None Backend:**
66+
```
67+
Document → LLM/No OCR → Empty Confidence Data → Assessment Disabled
68+
```
69+
- No confidence data available for assessment
70+
- Assessment feature cannot function without OCR confidence scores
71+
- Results in assessment being skipped or disabled
1872

1973
## Adding Configurations
2074

config_library/pattern-2/bank-statement-sample/config.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,15 @@
33

44
notes: Default settings
55
ocr:
6+
backend: "textract" # Default to Textract for backward compatibility
7+
model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
8+
system_prompt: "You are an expert OCR system. Extract all text from the provided image accurately, preserving layout where possible."
9+
task_prompt: "Extract all text from this document image. Preserve the layout, including paragraphs, tables, and formatting."
610
features:
711
- name: LAYOUT
12+
image:
13+
target_width: '951'
14+
target_height: '1268'
815
classes:
916
- name: Bank Statement
1017
description: Monthly bank account statement

0 commit comments

Comments
 (0)