aws-solutions-library-samples
diff --git a/‎CHANGELOG.md‎
Lines changed: 12 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/classification.md‎
Lines changed: 141 additions & 0 deletions b/‎docs/classification.md‎
Lines changed: 141 additions & 0 deletions
diff --git a/‎lib/idp_common_pkg/idp_common/classification/README.md‎
Lines changed: 163 additions & 0 deletions b/‎lib/idp_common_pkg/idp_common/classification/README.md‎
Lines changed: 163 additions & 0 deletions
@@ -6,6 +6,7 @@ SPDX-License-Identifier: MIT-0
 ## [Unreleased]
 
 ### Added
+
 - **Intelligent Document Discovery Module for Automated Configuration Generation**
   - Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
   - **Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
@@ -17,6 +18,17 @@ SPDX-License-Identifier: MIT-0
   - **Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
   - **Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
 
+- **Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
+  - Added support for optional regex patterns in document class definitions for performance optimization
+  - **Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
+  - **Document Page Content Regex**: Match against page text content during multi-modal page-level classification for fast page classification
+  - **Key Benefits**: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
+  - **Configuration**: Optional `document_name_regex` and `document_page_content_regex` fields in class definitions with automatic regex compilation and validation
+  - **Logging**: Comprehensive info-level logging when regex patterns match for observability and debugging
+  - **CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
+  - **Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
+  - **Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
+  
 - **Windows WSL Development Environment Setup Guide**
   - Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
   - **Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
 
@@ -577,6 +577,143 @@ The classification service uses the new `extract_structured_data_from_text()` fu
 - Handles malformed content gracefully
 - Returns both parsed data and detected format for logging
 
+## Regex-Based Classification for Performance Optimization
+
+Pattern 2 now supports optional regex-based classification that can provide significant performance improvements and cost savings by bypassing LLM calls when document patterns are recognized.
+
+### Document Name Regex (All Pages Same Class)
+
+When you want all pages of a document to be classified as the same class, you can use document name regex to instantly classify entire documents based on their filename or ID:
+
+```yaml
+classes:
+  - name: Payslip
+    description: "Employee wage statement showing earnings and deductions"
+    document_name_regex: "(?i).*(payslip|paystub|salary|wage).*"
+    attributes:
+      - name: EmployeeName
+        description: "Name of the employee"
+        attributeType: simple
+```
+
+**Benefits:**
+- **Instant Classification**: Entire document classified without any LLM calls
+- **Massive Performance Gains**: ~100-1000x faster than LLM classification
+- **Zero Token Usage**: Complete elimination of API costs for matched documents
+- **Deterministic Results**: Consistent classification for known patterns
+
+**When document ID matches the pattern:**
+- All pages are immediately classified as the matching class
+- Single section is created containing all pages
+- No backend service calls are made
+- Info logging confirms regex match
+
+### Page Content Regex (Multi-Modal Page-Level Classification)
+
+For multi-class configurations using page-level classification, you can use page content regex to classify individual pages based on text patterns:
+
+```yaml
+classification:
+  classificationMethod: multimodalPageLevelClassification
+
+classes:
+  - name: Invoice
+    description: "Business invoice document"
+    document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|amount\\s+due)"
+    attributes:
+      - name: InvoiceNumber
+        description: "Invoice number"
+        attributeType: simple
+  - name: Payslip
+    description: "Employee wage statement"
+    document_page_content_regex: "(?i)(gross\\s+pay|net\\s+pay|employee\\s+id)"
+    attributes:
+      - name: EmployeeName
+        description: "Employee name"
+        attributeType: simple
+  - name: Other
+    description: "Documents that don't match specific patterns"
+    # No regex - will always use LLM
+    attributes: []
+```
+
+**Benefits:**
+- **Selective Performance Gains**: Pages matching patterns are classified instantly
+- **Mixed Processing**: Some pages use regex, others fall back to LLM
+- **Cost Optimization**: Reduced token usage proportional to regex matches
+- **Maintained Accuracy**: LLM fallback ensures all pages are properly classified
+
+**How it works:**
+- Each page's text content is checked against all class regex patterns
+- First matching pattern wins and classifies the page instantly
+- Pages with no matches use standard LLM classification
+- Results are seamlessly integrated into document sections
+
+### Regex Pattern Best Practices
+
+1. **Case-Insensitive Matching**: Always use `(?i)` flag
+   ```regex
+   (?i).*(invoice|bill).*  # Matches any case variation
+   ```
+
+2. **Flexible Whitespace**: Use `\\s+` for varying spaces/tabs
+   ```regex
+   (?i)(gross\\s+pay|net\\s+pay)  # Handles "gross pay", "gross  pay"
+   ```
+
+3. **Multiple Alternatives**: Use `|` for different terms
+   ```regex
+   (?i).*(payslip|paystub|salary|wage).*  # Any of these terms
+   ```
+
+4. **Balanced Specificity**: Specific enough to avoid false matches
+   ```regex
+   # Good: Specific to W2 forms
+   (?i)(form\\s+w-?2|wage\\s+and\\s+tax|employer\\s+identification)
+   
+   # Too broad: Could match many documents
+   (?i)(form|wage|tax)
+   ```
+
+### Performance Analysis
+
+Use `notebooks/examples/step2_classification_with_regex.ipynb` to:
+- Test regex patterns against your documents
+- Compare processing speeds (regex vs LLM)
+- Analyze cost savings through token usage reduction
+- Validate classification accuracy
+- Debug pattern matching behavior
+
+### Error Handling
+
+The regex system includes robust error handling:
+- **Invalid Patterns**: Compilation errors are logged, system falls back to LLM
+- **Runtime Failures**: Pattern matching errors default to LLM classification  
+- **Graceful Degradation**: Service continues working with invalid regex
+- **Comprehensive Logging**: Detailed logs for debugging pattern issues
+
+### Configuration Examples
+
+**Common Document Types:**
+```yaml
+classes:
+  # W2 Tax Forms
+  - name: W2
+    document_page_content_regex: "(?i)(form\\s+w-?2|wage\\s+and\\s+tax|social\\s+security)"
+    
+  # Bank Statements  
+  - name: Bank-Statement
+    document_page_content_regex: "(?i)(account\\s+number|statement\\s+period|beginning\\s+balance)"
+    
+  # Driver Licenses
+  - name: US-drivers-licenses
+    document_page_content_regex: "(?i)(driver\\s+license|state\\s+id|date\\s+of\\s+birth)"
+    
+  # Invoices
+  - name: Invoice
+    document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|remit\\s+payment)"
+```
+
 ## Best Practices for Classification
 
 1. **Provide Clear Class Descriptions**: Include distinctive features and common elements
@@ -595,3 +732,7 @@ The classification service uses the new `extract_structured_data_from_text()` fu
 14. **Test Segmentation Logic**: Verify that documents are correctly separated by reviewing section boundaries in the results
 15. **Consider Document Flow**: Ensure your document classes account for typical document structures (headers, body, footers)
 16. **Leverage BIO-like Tagging**: Take advantage of the automatic boundary detection to eliminate manual document splitting
+17. **Use Regex for Known Patterns**: Add regex patterns for document types with predictable content or naming conventions
+18. **Test Regex Thoroughly**: Validate regex patterns against diverse document samples before production use
+19. **Balance Regex Specificity**: Make patterns specific enough to avoid false matches but flexible enough to catch variations
+20. **Monitor Regex Performance**: Track how often regex patterns match vs fall back to LLM classification
@@ -10,6 +10,9 @@ This module provides document classification capabilities for the IDP Accelerato
 - Classification of documents using multiple backend options:
   - Amazon Bedrock LLMs
   - SageMaker UDOP models
+- **Optional regex-based classification for enhanced performance**
+  - Document name regex matching when all pages should be classified as the same class
+  - Page content regex matching for multi-modal page-level classification
 - Direct integration with the Document data model
 - Support for both text and image content
 - Concurrent processing of multiple pages
@@ -53,6 +56,166 @@ Page 6: type="invoice", boundary="continue"   → Section 3 (Invoice #2)
 
 The system automatically creates three sections, properly separating the two invoices despite them having the same document type.
 
+## Regex-Based Classification for Enhanced Performance
+
+The classification service now supports optional regex-based pattern matching to provide significant performance improvements and deterministic classification for known document patterns. This feature enables instant classification without LLM API calls when regex patterns match.
+
+### Document Name Regex Classification
+
+When you want all pages of a document to be classified the same way, document name regex patterns can instantly classify entire documents based on their filename or ID:
+
+```yaml
+classes:
+  - name: Payslip
+    description: "Employee wage statement showing earnings and deductions"
+    document_name_regex: "(?i).*(payslip|paystub|salary|wage).*"
+    attributes:
+      - name: EmployeeName
+        description: "Name of the employee"
+        attributeType: simple
+```
+
+**How it works:**
+- Works with any number of document classes defined in configuration
+- When document ID matches the regex pattern, all pages are classified as that class
+- Skips all LLM processing for massive performance gains
+- Provides info-level logging when matches occur
+
+### Page Content Regex Classification
+
+For multi-modal page-level classification, page content regex patterns can classify individual pages based on text content:
+
+```yaml
+classes:
+  - name: Invoice
+    description: "Business invoice document"
+    document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|amount\\s+due)"
+    attributes:
+      - name: InvoiceNumber
+        description: "Invoice number"
+        attributeType: simple
+  - name: Payslip
+    description: "Employee wage statement"  
+    document_page_content_regex: "(?i)(gross\\s+pay|net\\s+pay|employee\\s+id)"
+    attributes:
+      - name: EmployeeName
+        description: "Employee name"
+        attributeType: simple
+```
+
+**How it works:**
+- Only applies to multi-modal page-level classification method
+- Each page's text content is checked against all class regex patterns
+- First matching pattern wins and classifies the page instantly
+- Falls back to LLM classification when no patterns match
+- Provides info-level logging when matches occur
+
+### Configuration Options
+
+Both regex types are optional and can be used together:
+
+```yaml
+classes:
+  - name: W2-Form
+    description: "W2 tax form with wage and tax information"
+    # Both regex types can be specified
+    document_name_regex: "(?i).*w-?2.*"  # For single-class scenarios
+    document_page_content_regex: "(?i)(form\\s+w-?2|wage\\s+and\\s+tax)"  # For page-level
+    attributes:
+      - name: EmployerEIN
+        description: "Employer identification number"
+        attributeType: simple
+```
+
+### Performance Benefits
+
+**Speed Improvements:**
+- Regex matching is nearly instantaneous compared to LLM calls
+- Document name regex: ~100-1000x faster (entire document classified instantly)
+- Page content regex: ~10-50x faster per matched page
+
+**Cost Savings:**
+- Zero token usage for regex-matched classifications
+- No Bedrock/SageMaker API calls for matched patterns
+- Significant cost reduction for documents with recognizable patterns
+
+**Deterministic Results:**
+- Consistent classification results for pattern-matched documents
+- Eliminates LLM variability for known document types
+- Reliable classification for high-volume processing scenarios
+
+### Best Practices for Regex Patterns
+
+1. **Case-Insensitive Matching**: Use `(?i)` flag for robust matching
+   ```regex
+   (?i).*(invoice|bill).*  # Matches "Invoice", "INVOICE", "bill", "BILL"
+   ```
+
+2. **Flexible Whitespace**: Use `\\s+` for varying whitespace
+   ```regex
+   (?i)(gross\\s+pay|net\\s+pay)  # Matches "gross pay", "gross  pay", "GROSS PAY"
+   ```
+
+3. **Multiple Alternatives**: Use `|` for different possible terms
+   ```regex
+   (?i).*(payslip|paystub|salary|wage).*  # Matches any of these terms
+   ```
+
+4. **Specific Enough**: Balance specificity to avoid false matches
+   ```regex
+   # Good: Specific to payslips
+   (?i)(gross\\s+pay|employee\\s+id|pay\\s+period)
+   
+   # Too broad: Could match many document types
+   (?i)(pay|id|period)
+   ```
+
+### Error Handling
+
+The regex system includes comprehensive error handling:
+
+- **Compilation Errors**: Invalid regex patterns are logged and ignored, fallback to LLM
+- **Runtime Errors**: Regex matching failures fallback to standard classification
+- **Graceful Degradation**: System continues to work normally even with invalid patterns
+- **Detailed Logging**: Debug and error logs help with pattern troubleshooting
+
+### Integration Example
+
+```python
+from idp_common import classification, get_config
+from idp_common.models import Document
+
+# Load configuration with regex patterns
+config = get_config()
+
+# Initialize service - regex patterns are automatically used
+service = classification.ClassificationService(
+    region="us-east-1",
+    config=config,
+    backend="bedrock"
+)
+
+# Classification automatically uses regex when patterns match
+document = service.classify_document(document)
+
+# Check if regex was used
+for page_id, page in document.pages.items():
+    metadata = getattr(page, 'metadata', {})
+    if metadata.get('regex_matched', False):
+        print(f"Page {page_id} was classified using regex patterns")
+    else:
+        print(f"Page {page_id} was classified using LLM")
+```
+
+### Demonstration Notebook
+
+See `notebooks/examples/step2_classification_with_regex.ipynb` for interactive demonstrations of:
+- Document name regex classification
+- Page content regex classification  
+- Performance comparisons between regex and LLM methods
+- Configuration examples and best practices
+- Error handling scenarios
+
 ## Usage Example
 
 ### Using with Bedrock LLMs (Default)