You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge branch 'feature/classification-regex' into 'develop'
Add regex-based classification for enhanced performance and cost optimization
See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!300
Copy file name to clipboardExpand all lines: CHANGELOG.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@ SPDX-License-Identifier: MIT-0
6
6
## [Unreleased]
7
7
8
8
### Added
9
+
9
10
-**Intelligent Document Discovery Module for Automated Configuration Generation**
10
11
- Added Discovery module that automatically analyzes document samples to identify structure, field types, and organizational patterns
11
12
-**Pattern-Neutral Design**: Works across all processing patterns (1, 2, 3) with unified discovery process and pattern-specific implementations
@@ -17,6 +18,17 @@ SPDX-License-Identifier: MIT-0
17
18
-**Use Cases**: New document exploration, configuration improvement, rapid prototyping, and document understanding
18
19
-**Documentation**: Guide in `docs/discovery.md` with architecture details, best practices, and troubleshooting
19
20
21
+
-**Optional Pattern-2 Regex-Based Classification for Enhanced Performance**
22
+
- Added support for optional regex patterns in document class definitions for performance optimization
23
+
-**Document Name Regex**: Match against document ID/name to classify all pages without LLM processing when all pages should be the same class
24
+
-**Document Page Content Regex**: Match against page text content during multi-modal page-level classification for fast page classification
25
+
-**Key Benefits**: Significant performance improvements and cost savings by bypassing LLM calls for pattern-matched documents, deterministic classification results for known document patterns, seamless fallback to existing LLM classification when regex patterns don't match
26
+
-**Configuration**: Optional `document_name_regex` and `document_page_content_regex` fields in class definitions with automatic regex compilation and validation
27
+
-**Logging**: Comprehensive info-level logging when regex patterns match for observability and debugging
28
+
-**CloudFormation Integration**: Updated Pattern-2 schema to support regex configuration through the Web UI
29
+
-**Demonstration**: New `step2_classification_with_regex.ipynb` notebook showcasing regex configuration and performance comparisons
30
+
-**Documentation**: Enhanced classification module README and main documentation with regex usage examples and best practices
31
+
20
32
-**Windows WSL Development Environment Setup Guide**
21
33
- Added WSL-based development environment setup guide for Windows developers in `docs/setup-development-env-WSL.md`
22
34
-**Key Features**: Automated setup script (`wsl_setup.sh`) for quick installation of Git, Python, Node.js, AWS CLI, and SAM CLI
Copy file name to clipboardExpand all lines: docs/classification.md
+141Lines changed: 141 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -577,6 +577,143 @@ The classification service uses the new `extract_structured_data_from_text()` fu
577
577
- Handles malformed content gracefully
578
578
- Returns both parsed data and detected format for logging
579
579
580
+
## Regex-Based Classification for Performance Optimization
581
+
582
+
Pattern 2 now supports optional regex-based classification that can provide significant performance improvements and cost savings by bypassing LLM calls when document patterns are recognized.
583
+
584
+
### Document Name Regex (All Pages Same Class)
585
+
586
+
When you want all pages of a document to be classified as the same class, you can use document name regex to instantly classify entire documents based on their filename or ID:
587
+
588
+
```yaml
589
+
classes:
590
+
- name: Payslip
591
+
description: "Employee wage statement showing earnings and deductions"
The system automatically creates three sections, properly separating the two invoices despite them having the same document type.
55
58
59
+
## Regex-Based Classification for Enhanced Performance
60
+
61
+
The classification service now supports optional regex-based pattern matching to provide significant performance improvements and deterministic classification for known document patterns. This feature enables instant classification without LLM API calls when regex patterns match.
62
+
63
+
### Document Name Regex Classification
64
+
65
+
When you want all pages of a document to be classified the same way, document name regex patterns can instantly classify entire documents based on their filename or ID:
66
+
67
+
```yaml
68
+
classes:
69
+
- name: Payslip
70
+
description: "Employee wage statement showing earnings and deductions"
0 commit comments