Skip to content

Commit 09071e5

Browse files
author
Bob Strahan
committed
Merge branch 'develop' of ssh.gitlab.aws.dev:genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator into develop
2 parents 34ba7ae + cedc959 commit 09071e5

File tree

8 files changed

+945
-2
lines changed

8 files changed

+945
-2
lines changed

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ SPDX-License-Identifier: MIT-0
66
## [Unreleased]
77

88
### Added
9+
910
- **Intelligent Document Discovery Module for Automated Configuration Generation**
1011
- Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
1112
- **Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
@@ -17,6 +18,17 @@ SPDX-License-Identifier: MIT-0
1718
- **Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
1819
- **Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
1920

21+
- **Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
22+
- Added support for optional regex patterns in document class definitions for performance optimization
23+
- **Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
24+
- **Document Page Content Regex**: Match against page text content during multi-modal page-level classification for fast page classification
25+
- **Key Benefits**: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
26+
- **Configuration**: Optional `document_name_regex` and `document_page_content_regex` fields in class definitions with automatic regex compilation and validation
27+
- **Logging**: Comprehensive info-level logging when regex patterns match for observability and debugging
28+
- **CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
29+
- **Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
30+
- **Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
31+
2032
- **Windows WSL Development Environment Setup Guide**
2133
- Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
2234
- **Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI

docs/classification.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -577,6 +577,143 @@ The classification service uses the new `extract_structured_data_from_text()` fu
577577
- Handles malformed content gracefully
578578
- Returns both parsed data and detected format for logging
579579

580+
## Regex-Based Classification for Performance Optimization
581+
582+
Pattern 2 now supports optional regex-based classification that can provide significant performance improvements and cost savings by bypassing LLM calls when document patterns are recognized.
583+
584+
### Document Name Regex (All Pages Same Class)
585+
586+
When you want all pages of a document to be classified as the same class, you can use document name regex to instantly classify entire documents based on their filename or ID:
587+
588+
```yaml
589+
classes:
590+
- name: Payslip
591+
description: "Employee wage statement showing earnings and deductions"
592+
document_name_regex: "(?i).*(payslip|paystub|salary|wage).*"
593+
attributes:
594+
- name: EmployeeName
595+
description: "Name of the employee"
596+
attributeType: simple
597+
```
598+
599+
**Benefits:**
600+
- **Instant Classification**: Entire document classified without any LLM calls
601+
- **Massive Performance Gains**: ~100-1000x faster than LLM classification
602+
- **Zero Token Usage**: Complete elimination of API costs for matched documents
603+
- **Deterministic Results**: Consistent classification for known patterns
604+
605+
**When document ID matches the pattern:**
606+
- All pages are immediately classified as the matching class
607+
- Single section is created containing all pages
608+
- No backend service calls are made
609+
- Info logging confirms regex match
610+
611+
### Page Content Regex (Multi-Modal Page-Level Classification)
612+
613+
For multi-class configurations using page-level classification, you can use page content regex to classify individual pages based on text patterns:
614+
615+
```yaml
616+
classification:
617+
classificationMethod: multimodalPageLevelClassification
618+
619+
classes:
620+
- name: Invoice
621+
description: "Business invoice document"
622+
document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|amount\\s+due)"
623+
attributes:
624+
- name: InvoiceNumber
625+
description: "Invoice number"
626+
attributeType: simple
627+
- name: Payslip
628+
description: "Employee wage statement"
629+
document_page_content_regex: "(?i)(gross\\s+pay|net\\s+pay|employee\\s+id)"
630+
attributes:
631+
- name: EmployeeName
632+
description: "Employee name"
633+
attributeType: simple
634+
- name: Other
635+
description: "Documents that don't match specific patterns"
636+
# No regex - will always use LLM
637+
attributes: []
638+
```
639+
640+
**Benefits:**
641+
- **Selective Performance Gains**: Pages matching patterns are classified instantly
642+
- **Mixed Processing**: Some pages use regex, others fall back to LLM
643+
- **Cost Optimization**: Reduced token usage proportional to regex matches
644+
- **Maintained Accuracy**: LLM fallback ensures all pages are properly classified
645+
646+
**How it works:**
647+
- Each page's text content is checked against all class regex patterns
648+
- First matching pattern wins and classifies the page instantly
649+
- Pages with no matches use standard LLM classification
650+
- Results are seamlessly integrated into document sections
651+
652+
### Regex Pattern Best Practices
653+
654+
1. **Case-Insensitive Matching**: Always use `(?i)` flag
655+
```regex
656+
(?i).*(invoice|bill).* # Matches any case variation
657+
```
658+
659+
2. **Flexible Whitespace**: Use `\\s+` for varying spaces/tabs
660+
```regex
661+
(?i)(gross\\s+pay|net\\s+pay) # Handles "gross pay", "gross pay"
662+
```
663+
664+
3. **Multiple Alternatives**: Use `|` for different terms
665+
```regex
666+
(?i).*(payslip|paystub|salary|wage).* # Any of these terms
667+
```
668+
669+
4. **Balanced Specificity**: Specific enough to avoid false matches
670+
```regex
671+
# Good: Specific to W2 forms
672+
(?i)(form\\s+w-?2|wage\\s+and\\s+tax|employer\\s+identification)
673+
674+
# Too broad: Could match many documents
675+
(?i)(form|wage|tax)
676+
```
677+
678+
### Performance Analysis
679+
680+
Use `notebooks/examples/step2_classification_with_regex.ipynb` to:
681+
- Test regex patterns against your documents
682+
- Compare processing speeds (regex vs LLM)
683+
- Analyze cost savings through token usage reduction
684+
- Validate classification accuracy
685+
- Debug pattern matching behavior
686+
687+
### Error Handling
688+
689+
The regex system includes robust error handling:
690+
- **Invalid Patterns**: Compilation errors are logged, system falls back to LLM
691+
- **Runtime Failures**: Pattern matching errors default to LLM classification
692+
- **Graceful Degradation**: Service continues working with invalid regex
693+
- **Comprehensive Logging**: Detailed logs for debugging pattern issues
694+
695+
### Configuration Examples
696+
697+
**Common Document Types:**
698+
```yaml
699+
classes:
700+
# W2 Tax Forms
701+
- name: W2
702+
document_page_content_regex: "(?i)(form\\s+w-?2|wage\\s+and\\s+tax|social\\s+security)"
703+
704+
# Bank Statements
705+
- name: Bank-Statement
706+
document_page_content_regex: "(?i)(account\\s+number|statement\\s+period|beginning\\s+balance)"
707+
708+
# Driver Licenses
709+
- name: US-drivers-licenses
710+
document_page_content_regex: "(?i)(driver\\s+license|state\\s+id|date\\s+of\\s+birth)"
711+
712+
# Invoices
713+
- name: Invoice
714+
document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|remit\\s+payment)"
715+
```
716+
580717
## Best Practices for Classification
581718

582719
1. **Provide Clear Class Descriptions**: Include distinctive features and common elements
@@ -595,3 +732,7 @@ The classification service uses the new `extract_structured_data_from_text()` fu
595732
14. **Test Segmentation Logic**: Verify that documents are correctly separated by reviewing section boundaries in the results
596733
15. **Consider Document Flow**: Ensure your document classes account for typical document structures (headers, body, footers)
597734
16. **Leverage BIO-like Tagging**: Take advantage of the automatic boundary detection to eliminate manual document splitting
735+
17. **Use Regex for Known Patterns**: Add regex patterns for document types with predictable content or naming conventions
736+
18. **Test Regex Thoroughly**: Validate regex patterns against diverse document samples before production use
737+
19. **Balance Regex Specificity**: Make patterns specific enough to avoid false matches but flexible enough to catch variations
738+
20. **Monitor Regex Performance**: Track how often regex patterns match vs fall back to LLM classification

lib/idp_common_pkg/idp_common/classification/README.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ This module provides document classification capabilities for the IDP Accelerato
1010
- Classification of documents using multiple backend options:
1111
- Amazon Bedrock LLMs
1212
- SageMaker UDOP models
13+
- **Optional regex-based classification for enhanced performance**
14+
- Document name regex matching when all pages should be classified as the same class
15+
- Page content regex matching for multi-modal page-level classification
1316
- Direct integration with the Document data model
1417
- Support for both text and image content
1518
- Concurrent processing of multiple pages
@@ -53,6 +56,166 @@ Page 6: type="invoice", boundary="continue" → Section 3 (Invoice #2)
5356

5457
The system automatically creates three sections, properly separating the two invoices despite them having the same document type.
5558

59+
## Regex-Based Classification for Enhanced Performance
60+
61+
The classification service now supports optional regex-based pattern matching to provide significant performance improvements and deterministic classification for known document patterns. This feature enables instant classification without LLM API calls when regex patterns match.
62+
63+
### Document Name Regex Classification
64+
65+
When you want all pages of a document to be classified the same way, document name regex patterns can instantly classify entire documents based on their filename or ID:
66+
67+
```yaml
68+
classes:
69+
- name: Payslip
70+
description: "Employee wage statement showing earnings and deductions"
71+
document_name_regex: "(?i).*(payslip|paystub|salary|wage).*"
72+
attributes:
73+
- name: EmployeeName
74+
description: "Name of the employee"
75+
attributeType: simple
76+
```
77+
78+
**How it works:**
79+
- Works with any number of document classes defined in configuration
80+
- When document ID matches the regex pattern, all pages are classified as that class
81+
- Skips all LLM processing for massive performance gains
82+
- Provides info-level logging when matches occur
83+
84+
### Page Content Regex Classification
85+
86+
For multi-modal page-level classification, page content regex patterns can classify individual pages based on text content:
87+
88+
```yaml
89+
classes:
90+
- name: Invoice
91+
description: "Business invoice document"
92+
document_page_content_regex: "(?i)(invoice\\s+number|bill\\s+to|amount\\s+due)"
93+
attributes:
94+
- name: InvoiceNumber
95+
description: "Invoice number"
96+
attributeType: simple
97+
- name: Payslip
98+
description: "Employee wage statement"
99+
document_page_content_regex: "(?i)(gross\\s+pay|net\\s+pay|employee\\s+id)"
100+
attributes:
101+
- name: EmployeeName
102+
description: "Employee name"
103+
attributeType: simple
104+
```
105+
106+
**How it works:**
107+
- Only applies to multi-modal page-level classification method
108+
- Each page's text content is checked against all class regex patterns
109+
- First matching pattern wins and classifies the page instantly
110+
- Falls back to LLM classification when no patterns match
111+
- Provides info-level logging when matches occur
112+
113+
### Configuration Options
114+
115+
Both regex types are optional and can be used together:
116+
117+
```yaml
118+
classes:
119+
- name: W2-Form
120+
description: "W2 tax form with wage and tax information"
121+
# Both regex types can be specified
122+
document_name_regex: "(?i).*w-?2.*" # For single-class scenarios
123+
document_page_content_regex: "(?i)(form\\s+w-?2|wage\\s+and\\s+tax)" # For page-level
124+
attributes:
125+
- name: EmployerEIN
126+
description: "Employer identification number"
127+
attributeType: simple
128+
```
129+
130+
### Performance Benefits
131+
132+
**Speed Improvements:**
133+
- Regex matching is nearly instantaneous compared to LLM calls
134+
- Document name regex: ~100-1000x faster (entire document classified instantly)
135+
- Page content regex: ~10-50x faster per matched page
136+
137+
**Cost Savings:**
138+
- Zero token usage for regex-matched classifications
139+
- No Bedrock/SageMaker API calls for matched patterns
140+
- Significant cost reduction for documents with recognizable patterns
141+
142+
**Deterministic Results:**
143+
- Consistent classification results for pattern-matched documents
144+
- Eliminates LLM variability for known document types
145+
- Reliable classification for high-volume processing scenarios
146+
147+
### Best Practices for Regex Patterns
148+
149+
1. **Case-Insensitive Matching**: Use `(?i)` flag for robust matching
150+
```regex
151+
(?i).*(invoice|bill).* # Matches "Invoice", "INVOICE", "bill", "BILL"
152+
```
153+
154+
2. **Flexible Whitespace**: Use `\\s+` for varying whitespace
155+
```regex
156+
(?i)(gross\\s+pay|net\\s+pay) # Matches "gross pay", "gross pay", "GROSS PAY"
157+
```
158+
159+
3. **Multiple Alternatives**: Use `|` for different possible terms
160+
```regex
161+
(?i).*(payslip|paystub|salary|wage).* # Matches any of these terms
162+
```
163+
164+
4. **Specific Enough**: Balance specificity to avoid false matches
165+
```regex
166+
# Good: Specific to payslips
167+
(?i)(gross\\s+pay|employee\\s+id|pay\\s+period)
168+
169+
# Too broad: Could match many document types
170+
(?i)(pay|id|period)
171+
```
172+
173+
### Error Handling
174+
175+
The regex system includes comprehensive error handling:
176+
177+
- **Compilation Errors**: Invalid regex patterns are logged and ignored, fallback to LLM
178+
- **Runtime Errors**: Regex matching failures fallback to standard classification
179+
- **Graceful Degradation**: System continues to work normally even with invalid patterns
180+
- **Detailed Logging**: Debug and error logs help with pattern troubleshooting
181+
182+
### Integration Example
183+
184+
```python
185+
from idp_common import classification, get_config
186+
from idp_common.models import Document
187+
188+
# Load configuration with regex patterns
189+
config = get_config()
190+
191+
# Initialize service - regex patterns are automatically used
192+
service = classification.ClassificationService(
193+
region="us-east-1",
194+
config=config,
195+
backend="bedrock"
196+
)
197+
198+
# Classification automatically uses regex when patterns match
199+
document = service.classify_document(document)
200+
201+
# Check if regex was used
202+
for page_id, page in document.pages.items():
203+
metadata = getattr(page, 'metadata', {})
204+
if metadata.get('regex_matched', False):
205+
print(f"Page {page_id} was classified using regex patterns")
206+
else:
207+
print(f"Page {page_id} was classified using LLM")
208+
```
209+
210+
### Demonstration Notebook
211+
212+
See `notebooks/examples/step2_classification_with_regex.ipynb` for interactive demonstrations of:
213+
- Document name regex classification
214+
- Page content regex classification
215+
- Performance comparisons between regex and LLM methods
216+
- Configuration examples and best practices
217+
- Error handling scenarios
218+
56219
## Usage Example
57220

58221
### Using with Bedrock LLMs (Default)

0 commit comments

Comments
 (0)