aws-solutions-library-samples
diff --git a/‎.gitlab-ci.yml‎
Lines changed: 2 additions & 1 deletion b/‎.gitlab-ci.yml‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎CHANGELOG.md‎
Lines changed: 21 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 21 additions & 1 deletion
diff --git a/‎docs/discovery.md‎
Lines changed: 32 additions & 170 deletions b/‎docs/discovery.md‎
Lines changed: 32 additions & 170 deletions
diff --git a/‎lib/idp_common_pkg/setup.py‎
Lines changed: 11 additions & 0 deletions b/‎lib/idp_common_pkg/setup.py‎
Lines changed: 11 additions & 0 deletions
@@ -56,10 +56,11 @@ integration_tests:
   #   AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION}
   #   IDP_ACCOUNT_ID: ${IDP_ACCOUNT_ID}
 
+ # Add rules to only run on develop branch
  # Add rules to only run on develop branch
   rules:
     - if: $CI_COMMIT_BRANCH == "develop"
-      when: always # manual # When idp-accelerator CICD is reconfigured
+      when: always # always # When idp-accelerator CICD is reconfigured
     - if: $CI_COMMIT_BRANCH =~ /^feature\/.*/
       when: always
     - if: $CI_COMMIT_BRANCH =~ /^fix\/.*/
 
@@ -6,6 +6,18 @@ SPDX-License-Identifier: MIT-0
 ## [Unreleased]
 
 ### Added
+
+- **Intelligent Document Discovery Module for Automated Configuration Generation**
+  - Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
+  - **Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
+  - **Dual Discovery Methods**: Discovery without ground truth (exploratory analysis) and with ground truth (optimization using labeled data)
+  - **Automated Blueprint Creation**: Pattern 1 includes zero-touch BDA blueprint generation with intelligent change detection and version management
+  - **Web UI Integration**: Real-time discovery job monitoring, interactive results review, and seamless configuration integration
+  - **Advanced Features**: Multi-model support (Nova, Claude), customizable prompts, configurable parameters, ground truth processing, schema conversion, and lifecycle management
+  - **Key Benefits**: Rapid new document type onboarding, reduced time-to-production, configuration optimization, and automated workflow bootstrapping
+  - **Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
+  - **Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
+
 - **Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
   - Added support for optional regex patterns in document class definitions for performance optimization
   - **Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
@@ -16,21 +28,29 @@ SPDX-License-Identifier: MIT-0
   - **CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
   - **Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
   - **Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
-
+  
 - **Windows WSL Development Environment Setup Guide**
   - Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
   - **Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
   - **Integrated Workflow**: Development setup combining Windows tools (VS Code, browsers) with native Linux environment
   - **Target Use Cases**: Windows developers needing Linux compatibility without Docker Desktop or VM overhead
 
 ### Fixed
+- **Security Vulnerability Mitigation - Package Updates**
+
 - **GovCloud Compatibility - Hardcoded Service Domain References**
   - Fixed hardcoded `amazonaws.com` references in CloudFormation templates that prevented GovCloud deployment
   - Updated all service principals and endpoints to use dynamic `${AWS::URLSuffix}` expressions for automatic region-based resolution
   - **Templates Updated**: `template.yaml` (main template), `patterns/pattern-3/sagemaker_classifier_endpoint.yaml`
   - **Services Fixed**: EventBridge, Cognito, SageMaker, ECR, CloudFront, CodeBuild, AppSync, Lambda, DynamoDB, CloudWatch Logs, Glue
   - Resolves GitHub Issue #50 - templates now deploy correctly in both standard AWS and GovCloud regions
 
+- **Bug Fixes and Code Improvements**
+  - Fixed HITL processing errors in both Pattern-1 (DynamoDB validation with empty strings) and Pattern-2 (string indices error in A2I output processing)
+  - Fixed Step Function UI issues including auto-refresh button auto-disable and fetch failures for failed executions with datetime serialization errors
+  - Cleaned up unused Step Function subscription infrastructure and removed duplicate code in Pattern-2 HITL function
+  - Expanded UI Visual Editor bounding box size with padding for better visibility and user interaction
+
 
 ## [0.3.14]
 
 
@@ -57,7 +57,9 @@ This dual approach ensures discovery insights can be leveraged across different
 - [Troubleshooting](#troubleshooting)
   - [Common Issues](#common-issues)
   - [Error Handling](#error-handling)
-  - [Performance Optimization](#performance-optimization)
+  [Limitations](#limitations)
+  - [Known Limitations](#known-limitations)
+  
 
 ## Overview
 
@@ -133,6 +135,10 @@ This analysis produces structured configuration templates that can be used to co
 - **Metadata Storage**: Pattern-neutral job information and progress tracking
 - **Event Coordination**: Enables real-time updates and pattern-specific notifications
 
+**Configuration Table:**
+- **Metadata Storage**: Discovered classes are stored in configuration table as "custom" configuration classes
+
+
 ### Pattern-Specific Implementations
 
 #### Pattern 1: BDA Blueprint Automation
@@ -176,8 +182,6 @@ The discovery system is designed to support additional patterns through:
 # Configuration event structure (pattern-agnostic)
 {
     "eventType": "CONFIGURATION_UPDATE",
-    "pattern": "pattern-X",
-    "discoveryJobId": "job-12345",
     "documentClasses": [...],
     "metadata": {...}
 }
@@ -209,28 +213,24 @@ graph TD
     G --> H
     H --> I[Structure Extraction]
     I --> J[Pattern-Neutral Configuration]
-    J --> K{Target Pattern?}
-    K -->|Pattern 1| L[BDA Blueprint Creation]
-    K -->|Pattern 2/3| M[Direct Config Update]
-    K -->|New Pattern| N[Custom Handler]
-    L --> O[Job Completion]
-    M --> O
-    N --> O
-    O --> P[UI Notification]
+    J --> K[Configuration Table Update]
+    K --> P[Job Completion]
+    P --> O[UI Notification]
 ```
 
 #### Pattern 1: BDA Blueprint Automation Flow
 ```mermaid
 graph TD
-    A[Configuration Update Event] --> B[BDA Discovery Function]
-    B --> C[BDA Blueprint Service]
-    C --> D{Blueprint Exists?}
-    D -->|Yes| E[Check for Changes]
-    D -->|No| F[Create New Blueprint]
-    E -->|Changes Found| G[Update Blueprint]
-    E -->|No Changes| H[Skip Update]
-    F --> I[Schema Converter]
-    G --> I
+    A[View/Edit Configuration UI] --> B[Save Changes]
+    B --> C[Configuration Update Event]
+    C --> D[BDA Discovery Lambda - Blueprint Service]
+    D --> E{Blueprint Exists?}
+    E -->|Yes| F[Check for Changes]
+    E -->|No| G[Create New Blueprint]
+    F -->|Changes Found| H[Update Blueprint]
+    F -->|No Changes| N[Skip Update]
+    G --> I[Schema Converter]
+    H --> I[Schema Converter]
     I --> J[Generate BDA Schema]
     J --> K[Create/Update in BDA]
     K --> L[Create Blueprint Version]
@@ -492,7 +492,7 @@ discovery:
 
 **Group Types:**
 - `normal` - Standard field groupings
-- `Table` - Repeating tabular data structures
+- `List` - Repeating tabular data structures
 
 ## Using the Discovery Module
 
@@ -609,25 +609,7 @@ Configuration events are triggered when discovery jobs complete and contain patt
 {
   "eventType": "CONFIGURATION_UPDATE",
   "source": "discovery-processor",
-  "pattern": "pattern-1",
-  "discoveryJobId": "discovery-job-12345",
-  "timestamp": "2024-01-15T10:30:00Z",
-  "documentClasses": [
-    {
-      "name": "W4Form",
-      "description": "Employee withholding certificate",
-      "groups": [...],
-      "metadata": {
-        "confidence": 0.95,
-        "model_used": "us.amazon.nova-pro-v1:0"
-      }
-    }
-  ],
-  "processingMetadata": {
-    "groundTruthUsed": true,
-    "processingTime": "45.2s",
-    "documentCount": 1
-  }
+  "timestamp": "2024-01-15T10:30:00Z"
 }
 ```
 
@@ -695,10 +677,10 @@ def pattern_x_configuration_handler(event, context):
     """
     try:
         # Extract discovery results from event
-        document_classes = event.get('documentClasses', [])
+        # retrieve custom classes from configuration table for processing.
         
         # Transform to pattern-specific format
-        pattern_config = transform_to_pattern_x_format(document_classes)
+        pattern_config = transform_to_pattern_x_format(custom_classes)
         
         # Update pattern-specific configuration
         update_pattern_x_configuration(pattern_config)
@@ -731,51 +713,6 @@ ConfigurationEventSource:
     FunctionName: !Ref PatternXConfigurationHandler
 ```
 
-#### Step 3: Implement Schema Transformation
-```python
-class PatternXSchemaConverter:
-    """
-    Converts pattern-neutral discovery results to Pattern X format.
-    """
-    
-    def convert(self, discovery_result):
-        """
-        Transform discovery document class to Pattern X configuration.
-        """
-        pattern_x_config = {
-            "documentType": discovery_result["name"],
-            "description": discovery_result["description"],
-            "extractionRules": []
-        }
-        
-        # Transform groups and fields
-        for group in discovery_result.get("groups", []):
-            extraction_rule = self._convert_group_to_rule(group)
-            pattern_x_config["extractionRules"].append(extraction_rule)
-        
-        return pattern_x_config
-```
-
-#### Step 4: Integration Points
-```python
-# Configuration update integration
-def update_pattern_x_configuration(config):
-    """
-    Update Pattern X configuration with discovery results.
-    """
-    # Store in configuration database
-    config_table.put_item(
-        Item={
-            "ConfigurationType": "PatternX",
-            "DocumentClasses": config,
-            "UpdatedAt": datetime.utcnow().isoformat()
-        }
-    )
-    
-    # Trigger any pattern-specific post-processing
-    notify_pattern_x_services(config)
-```
-
 ### Benefits of the Generic Event System
 
 **🔄 Loose Coupling**: Patterns can implement discovery integration independently
@@ -1278,90 +1215,15 @@ def discovery_with_fallback(discovery_service, document_key, ground_truth_key=No
         )
 ```
 
-### Performance Optimization
+## Limitations
 
-**Document Preprocessing:**
-```python
-def optimize_document_for_discovery(document_path):
-    """Optimize document for better discovery performance."""
-    # Resize images to optimal dimensions
-    if document_path.lower().endswith(('.jpg', '.jpeg', '.png')):
-        optimize_image_resolution(document_path, target_dpi=150)
-    
-    # Split large PDFs into manageable sections
-    elif document_path.lower().endswith('.pdf'):
-        page_count = get_pdf_page_count(document_path)
-        if page_count > 10:
-            return split_pdf_into_sections(document_path, max_pages=10)
-    
-    return [document_path]
-```
-
-**Batch Processing:**
-```python
-def batch_discovery_processing(document_list, batch_size=5):
-    """Process multiple documents efficiently."""
-    results = []
-    
-    for i in range(0, len(document_list), batch_size):
-        batch = document_list[i:i + batch_size]
-        
-        # Process batch concurrently
-        with ThreadPoolExecutor(max_workers=batch_size) as executor:
-            futures = [
-                executor.submit(process_single_document, doc)
-                for doc in batch
-            ]
-            
-            batch_results = [
-                future.result() for future in as_completed(futures)
-            ]
-            
-        results.extend(batch_results)
-        
-        # Rate limiting between batches
-        time.sleep(1)
-    
-    return results
-```
-
-**Caching and Reuse:**
-```python
-def cached_discovery_analysis(document_hash, config_hash):
-    """Cache discovery results for reuse."""
-    cache_key = f"discovery:{document_hash}:{config_hash}"
-    
-    # Check cache first
-    cached_result = get_from_cache(cache_key)
-    if cached_result:
-        return cached_result
-    
-    # Perform discovery if not cached
-    result = perform_discovery_analysis()
-    
-    # Cache result for future use
-    set_cache(cache_key, result, ttl=3600)  # 1 hour TTL
-    
-    return result
-```
+### Known Limitations
+**Configuration Table**
+- Discovery feature stores all custom classes as an array in Configuration table with "custom" key. 
+- DynamoDB has hard limit of 440 KB per item. We have to refactor to store classes in multiple items in DynamoDB.
+**Discovery Output Format**
+- Output format is configuration via View/Edit configuration. JSON format should follow custom classes format.  
+- Output in any other format will result in failure.
 
-**Monitoring and Metrics:**
-```python
-def track_discovery_metrics(job_id, start_time, result):
-    """Track discovery performance metrics."""
-    processing_time = time.time() - start_time
-    
-    metrics = {
-        'job_id': job_id,
-        'processing_time_seconds': processing_time,
-        'fields_discovered': count_discovered_fields(result),
-        'groups_identified': count_groups(result),
-        'model_used': result.get('metadata', {}).get('model_id'),
-        'success': result.get('status') == 'SUCCESS'
-    }
-    
-    # Send to CloudWatch or monitoring system
-    publish_metrics(metrics)
-```
 
 The Discovery module provides a powerful foundation for understanding and processing new document types. By following these guidelines and best practices, you can effectively leverage the module to bootstrap document processing workflows and continuously improve their accuracy and coverage.
@@ -79,6 +79,13 @@
         "ipykernel>=6.29.5,<7.0.0",
         "jupyter>=1.1.1,<2.0.0",
     ],
+    # Agents module dependencies
+    "agents": [
+        "strands-agents>=1.0.0",
+        "strands-agents-tools>=0.2.2",
+        "bedrock-agentcore>=0.1.1",  # Specifically for the code interpreter tool
+        "regex>=2024.0.0,<2026.0.0",  # Pin regex version to avoid conflicts
+    ],
     # Full package with all dependencies
     "all": [
         "Pillow==11.2.1",
@@ -91,6 +98,10 @@
         "pyarrow==20.0.0",
         "openpyxl==3.1.5",
         "python-docx==1.2.0",
+        "strands-agents>=1.0.0",
+        "strands-agents-tools>=0.2.2",
+        "bedrock-agentcore>=0.1.1",
+        "regex>=2024.0.0,<2026.0.0",
     ],
 }