aws-solutions-library-samples
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitlab-ci.yml‎
Lines changed: 2 additions & 0 deletions b/‎.gitlab-ci.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎1751146101381_classification_state.json‎
Lines changed: 1 addition & 0 deletions b/‎1751146101381_classification_state.json‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 98 additions & 2 deletions b/‎CHANGELOG.md‎
Lines changed: 98 additions & 2 deletions
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 4 additions & 2 deletions b/‎README.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎VERSION‎
Lines changed: 1 addition & 1 deletion b/‎VERSION‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎config_library/pattern-1/default/config.yaml‎
Lines changed: 2 additions & 0 deletions b/‎config_library/pattern-1/default/config.yaml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎config_library/pattern-2/README.md‎
Lines changed: 57 additions & 3 deletions b/‎config_library/pattern-2/README.md‎
Lines changed: 57 additions & 3 deletions
diff --git a/‎config_library/pattern-2/bank-statement-sample/config.yaml‎
Lines changed: 7 additions & 0 deletions b/‎config_library/pattern-2/bank-statement-sample/config.yaml‎
Lines changed: 7 additions & 0 deletions
@@ -15,4 +15,5 @@ build/
 __pycache__
 *.code-workspace
 .ruff_cache
+.kiro
 rvl_cdip_*
@@ -28,6 +28,8 @@ developer_tests:
     - apt-get update -y
     - apt-get install make -y
     - pip install ruff
+    # Install test dependencies
+    - cd lib/idp_common_pkg && pip install -e ".[test]" && cd ../..
 
   script:
     - make lint-cicd
 
@@ -5,14 +5,110 @@ SPDX-License-Identifier: MIT-0
 
 ## [Unreleased]
 
+## [0.3.6]
+
+### Fixed
+- Update Athena/Glue table configuration to use Parquet format instead of JSON #20
+- Cloudformation Error when Changing Evaluation Bucket Name #19
+
+### Added
+- **Extended Document Format Support in OCR Service**
+  - Added support for processing additional document formats beyond PDF and images:
+    - Plain text (.txt) files with automatic pagination for large documents
+    - CSV (.csv) files with table visualization and structured output
+    - Excel workbooks (.xlsx, .xls) with multi-sheet support (each sheet as a page)
+    - Word documents (.docx, .doc) with text extraction and visual representation
+  - **Key Features**:
+    - Consistent processing model across all document formats
+    - Standard page image generation for all formats
+    - Structured text output in formats compatible with existing extraction pipelines
+    - Confidence metrics for all document types
+    - Automatic format detection from file content and extension
+  - **Implementation Details**:
+    - Format-specific processing strategies for optimal results
+    - Enhanced text rendering for plain text documents
+    - Table visualization for CSV and Excel data
+    - Word document paragraph extraction with formatting preservation
+    - S3 storage integration matching existing PDF processing workflow
+
+## [0.3.5]
+
+### Added
+- **Human-in-the-Loop (HITL) Support - Pattern 1**
+  - Added comprehensive Human-in-the-Loop review capabilities using Amazon SageMaker Augmented AI (A2I)
+  - **Key Features**:
+    - Automatic triggering when extraction confidence falls below configurable threshold
+    - Integration with SageMaker A2I Review Portal for human validation and correction
+    - Configurable confidence threshold through Web UI Portal Configuration tab (0.0-1.0 range)
+    - Seamless result integration with human-verified data automatically updating source results
+  - **Workflow Integration**: 
+    - HITL tasks created automatically when confidence thresholds are not met
+    - Reviewers can validate correct extractions or make necessary corrections through the Review Portal
+    - Document processing continues with human-verified data after review completion
+  - **Configuration Management**:
+    - `EnableHITL` parameter for feature toggle
+    - Confidence threshold configurable via Web UI without stack redeployment
+    - Support for existing private workforce work teams via input parameter
+  - **CloudFormation Output**: Added `SageMakerA2IReviewPortalURL` for easy access to review portal
+  - **Known Limitations**: Current A2I version cannot provide direct hyperlinks to specific document tasks; template updates require resource recreation
+- **Document Compression for Large Documents - all patterns**
+  - Added automatic compression support to handle large documents and avoid exceeding Step Functions payload limits (256KB)
+  - **Key Features**:
+    - Automatic compression (default trigger threshold of 0KB enables compression by default)
+    - Transparent handling of both compressed and uncompressed documents in Lambda functions
+    - Temporary S3 storage for compressed document state with automatic cleanup via lifecycle policies
+  - **New Utility Methods**:
+    - `Document.load_document()`: Automatically detects and decompresses document input from Lambda events
+    - `Document.serialize_document()`: Automatically compresses large documents for Lambda responses
+    - `Document.compress()` and `Document.decompress()`: Compression/decompression methods
+  - **Lambda Function Integration**: All relevant Lambda functions updated to use compression utilities
+  - **Resolves Step Functions Errors**: Eliminates "result with a size exceeding the maximum number of bytes service limit" errors for large multi-page documents
+- **Multi-Backend OCR Support - Pattern 2 and 3**
+  - Textract Backend (default): Existing AWS Textract functionality
+  - Bedrock Backend: New LLM-based OCR using Claude/Nova models
+  - None Backend: Image-only processing without OCR
+- **Bedrock OCR Integration - Pattern 2 and 3**
+  - Customizable system and task prompts for OCR optimization
+  - Better handling of complex documents, tables, and forms
+  - Layout preservation capabilities
+- **Image Preprocessing - Pattern 2**
+  - Adaptive Binarization: Improves OCR accuracy on documents with:
+    - Uneven lighting or shadows
+    - Low contrast text
+    - Background noise or gradients
+  - Optional feature with configurable enable/disable
+- **YAML Parsing Support for LLM Responses - Pattern 2 and 3**
+  - Added comprehensive YAML parsing capabilities to complement existing JSON parsing functionality
+  - New `extract_yaml_from_text()` function with robust multi-strategy YAML extraction:
+    - YAML in ```yaml and ```yml code blocks
+    - YAML with document markers (---)
+    - Pattern-based YAML detection using indentation and key indicators
+  - New `detect_format()` function for automatic format detection returning 'json', 'yaml', or 'unknown'
+  - New unified `extract_structured_data_from_text()` wrapper function that automatically detects and parses both JSON and YAML formats
+  - **Token Efficiency**: YAML typically uses 10-30% fewer tokens than equivalent JSON due to more compact syntax
+  - **Service Integration**: Updated classification service to use the new unified parsing function with automatic fallback between formats
+  - **Comprehensive Testing**: Added 39 new unit tests covering all YAML extraction strategies, format detection, and edge cases
+  - **Backward Compatibility**: All existing JSON functionality preserved unchanged, new functionality is purely additive
+  - **Intelligent Fallback**: Robust fallback mechanism handles cases where preferred format fails (e.g., JSON requested as YAML falls back to JSON)
+  - **Production Ready**: Handles malformed content gracefully, comprehensive error handling and logging
+  - **Example Notebook**: Added `notebooks/examples/step3_extraction_using_yaml.ipynb` demonstrating YAML-based extraction with automatic format detection and token efficiency benefits
+
+### Fixed
+- **Enhanced JSON Extraction from LLM Responses (Issue #16)**
+  - Modularized duplicate `_extract_json()` functions across classification, extraction, summarization, and assessment services into a common `extract_json_from_text()` utility function
+  - Improved multi-line JSON handling with literal newlines in string values that previously caused parsing failures
+  - Added robust JSON validation and multiple fallback strategies for better extraction reliability
+  - Enhanced string parsing with proper escape sequence handling for quotes and newlines
+  - Added comprehensive unit tests covering various JSON formats including multi-line scenarios
+
 ## [0.3.4]
 
 ### Added
 - **Configurable Image Processing and Enhanced Resizing Logic**
   - **Improved Image Resizing Algorithm**: Enhanced aspect-ratio preserving scaling that only downsizes when necessary (scale factor < 1.0) to prevent image distortion
   - **Configurable Image Dimensions**: All processing services (Assessment, Classification, Extraction, OCR) now support configurable image dimensions through configuration with default 951×1268 resolution
   - **Service-Specific Image Optimization**: Each service can use optimal image dimensions for performance and quality tuning
-  - **Enhanced OCR Service**: Added configurable DPI for PDF-to-image conversion (default: 300) and optional image resizing with dual image strategy (stores original high-DPI images while using resized images for processing)
+  - **Enhanced OCR Service**: Added configurable DPI for PDF-to-image conversion and optional image resizing with dual image strategy (stores original high-DPI images while using resized images for processing)
   - **Runtime Configuration**: No code changes needed to adjust image processing - all configurable through service configuration
   - **Backward Compatibility**: Default values maintain existing behavior with no immediate action required for existing deployments
 - **Enhanced Configuration Management**
@@ -308,7 +404,7 @@ The `idp_common_pkg` introduces a unified Document model approach for consistent
 - **Section**: Represents logical document sections with classification and extraction results
 
 #### Service Classes
-- **OcrService**: Processes documents with AWS Textract and updates the Document with OCR results
+- **OcrService**: Processes documents with AWS Textract or Amazon Bedrock and updates the Document with OCR results
 - **ClassificationService**: Classifies document pages/sections using Bedrock or SageMaker backends
 - **ExtractionService**: Extracts structured information from document sections using Bedrock
 
 
@@ -46,4 +46,4 @@ commit: lint test
 	export COMMIT_MESSAGE="$(shell q chat --no-interactive --trust-all-tools "Understand pending local git change and changes to be committed, then infer a commit message. Return this commit message only" | tail -n 1 | sed 's/\x1b\[[0-9;]*m//g')" && \
 	git add . && \
 	git commit -am "$${COMMIT_MESSAGE}" && \
-	git push
+	git push
@@ -74,7 +74,7 @@ After deployment, you can quickly process a document and view results:
 
 2. **Use Sample Documents**:
    - For Pattern 1 (BDA): Use [samples/lending_package.pdf](./samples/lending_package.pdf)
-   - For Patterns 2 and 3: Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf) 
+   - For Patterns 2 and 3: Use [samples/rvl_cdip_package.pdf](./samples/rvl_cdip_package.pdf)
 
 3. **Monitor Processing**:
    - **Via Web UI**: Track document status on the dashboard
@@ -98,7 +98,9 @@ To update an existing GenAIIDP stack to a new version:
 2. Select your existing stack
 3. Click "Update"
 4. Select "Replace current template"
-5. Enter the template URL: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main.yaml`
+5. Enter the template URL: 
+   - us-west-2: `https://s3.us-west-2.amazonaws.com/aws-ml-blog-us-west-2/artifacts/genai-idp/idp-main.yaml`
+   - us-east-1: `https://s3.us-east-1.amazonaws.com/aws-ml-blog-us-east-1/artifacts/genai-idp/idp-main.yaml`
 6. Follow the prompts to update your stack, reviewing any parameter changes
 7. For detailed instructions, see the [Deployment Guide](./docs/deployment.md#updating-an-existing-stack)
 
 
@@ -1 +1 @@
-0.3.4
+0.3.6
@@ -2,6 +2,8 @@
 # SPDX-License-Identifier: MIT-0
 
 notes: Processing configuration in BDA project.
+assessment:
+  default_confidence_threshold: '0.8'
 summarization:
   top_p: '0.1'
   max_tokens: '4096'
 
@@ -10,11 +10,65 @@ This directory contains configurations for Pattern 2 of the GenAI IDP Accelerato
 Pattern 2 implements an intelligent document processing workflow that uses Amazon Bedrock with Nova or Claude models for both page classification/grouping and information extraction.
 
 Key components of Pattern 2:
-- OCR processing using Amazon Textract
-- Document classification using Claude via Amazon Bedrock (with two available methods):
+- **OCR processing** with multiple backend options (Textract, Bedrock LLM, or image-only)
+- **Document classification** using Claude via Amazon Bedrock (with two available methods):
   - Page-level classification: Classifies individual pages and groups them
   - Holistic packet classification: Analyzes multi-document packets to identify document boundaries
-- Field extraction using Claude via Amazon Bedrock
+- **Field extraction** using Claude via Amazon Bedrock
+- **Assessment functionality** for confidence evaluation of extraction results
+
+## OCR Backend Selection for Pattern 2
+
+Pattern 2 supports multiple OCR backends, each with different implications for the assessment feature:
+
+### Textract Backend (Default - Recommended)
+- **Best for**: Production workflows, when assessment is enabled
+- **Assessment Impact**: ✅ Full assessment capability with granular confidence scores
+- **Text Confidence Data**: Rich confidence information for each text block
+- **Cost**: Standard Textract pricing
+
+### Bedrock Backend (LLM-based OCR)
+- **Best for**: Challenging documents where traditional OCR fails
+- **Assessment Impact**: ❌ Assessment disabled - no confidence data available
+- **Text Confidence Data**: Empty (no confidence scores from LLM OCR)
+- **Cost**: Bedrock LLM inference costs
+
+### None Backend (Image-only)
+- **Best for**: Custom OCR integration, image-only workflows
+- **Assessment Impact**: ❌ Assessment disabled - no OCR text available
+- **Text Confidence Data**: Empty
+- **Cost**: No OCR costs
+
+> ⚠️ **Assessment Recommendation**: Use Textract backend (default) when assessment functionality is required. Bedrock and None backends eliminate assessment capability due to lack of confidence data.
+
+## Text Confidence Data and Assessment Integration
+
+Pattern 2's assessment feature relies on text confidence data generated during the OCR phase to evaluate extraction quality and provide confidence scores for each extracted attribute.
+
+### How Text Confidence Data Enables Assessment
+
+1. **OCR Phase**: Textract generates confidence scores for each text block during document processing
+2. **Condensed Format**: OCR service creates optimized `textConfidence.json` files with 80-90% token reduction
+3. **Assessment Phase**: LLM analyzes extraction results against OCR confidence data to provide accurate confidence evaluation
+4. **UI Integration**: Assessment results appear in the web interface with color-coded confidence indicators
+
+### Assessment Workflow Impact by OCR Backend
+
+**With Textract Backend:**
+```
+Document → Textract OCR → Rich Confidence Data → Assessment LLM → Confidence Scores
+```
+- Assessment LLM receives detailed confidence information for each text region
+- Can accurately evaluate extraction confidence based on OCR quality
+- Provides meaningful confidence scores and explanations
+
+**With Bedrock/None Backend:**
+```
+Document → LLM/No OCR → Empty Confidence Data → Assessment Disabled
+```
+- No confidence data available for assessment
+- Assessment feature cannot function without OCR confidence scores
+- Results in assessment being skipped or disabled
 
 ## Adding Configurations
 
 
@@ -3,8 +3,15 @@
 
 notes: Default settings
 ocr:
+  backend: "textract"  # Default to Textract for backward compatibility
+  model_id: "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
+  system_prompt: "You are an expert OCR system. Extract all text from the provided image accurately, preserving layout where possible."
+  task_prompt: "Extract all text from this document image. Preserve the layout, including paragraphs, tables, and formatting."
   features:
     - name: LAYOUT
+  image:
+    target_width: '951'
+    target_height: '1268'
 classes:
   - name: Bank Statement
     description: Monthly bank account statement