Skip to content

Commit 6d13626

Browse files
author
Bob Strahan
committed
v0.2.17
1 parent 9979b20 commit 6d13626

File tree

89 files changed

+12530
-4560
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+12530
-4560
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
build.toml
33
model.tar.gz
44
.checksum
5+
.checksums/
56
.vscode/
67
.DS_Store
78
dist/

CHANGELOG.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Changelog
2+
3+
4+
## [0.2.17]
5+
6+
### Enhanced Textract OCR Features
7+
- Added support for Textract advanced features (TABLES, FORMS, SIGNATURES, LAYOUT)
8+
- OCR results now output in rich markdown format for better visualization
9+
- Configurable OCR feature selection through schema configuration
10+
- Improved metering and tracking for different Textract feature combinations
11+
12+
## [0.2.16]
13+
14+
### Add additional model choice
15+
- Claude, Nova, Meta, and DeepSeek model selection now available
16+
17+
### New Document-Based Architecture
18+
19+
The `idp_common_pkg` introduces a unified Document model approach for consistent document processing:
20+
21+
#### Core Classes
22+
- **Document**: Central data model that tracks document state through the entire processing pipeline
23+
- **Page**: Represents individual document pages with OCR results and classification
24+
- **Section**: Represents logical document sections with classification and extraction results
25+
26+
#### Service Classes
27+
- **OcrService**: Processes documents with AWS Textract and updates the Document with OCR results
28+
- **ClassificationService**: Classifies document pages/sections using Bedrock or SageMaker backends
29+
- **ExtractionService**: Extracts structured information from document sections using Bedrock
30+
31+
### Pattern Implementation Updates
32+
- Lambda functions refactored, and significantly simplified, to use Document and Section objects, and new Service classes
33+
34+
### Key Benefits
35+
36+
1. **Simplified Integration**: Consistent interfaces make service integration straightforward
37+
2. **Improved Maintainability**: Unified data model reduces code duplication and complexity
38+
3. **Better Error Handling**: Standardized approach to error capture and reporting
39+
4. **Enhanced Traceability**: Complete document history throughout the processing pipeline
40+
5. **Flexible Backend Support**: Easy switching between Bedrock and SageMaker backends
41+
6. **Optimized Resource Usage**: Focused document processing for better performance
42+
7. **Granular Package Installation**: Install only required components with extras syntax
43+
44+
### Example Notebook
45+
46+
A new comprehensive Jupyter notebook demonstrates the Document-based workflow:
47+
- Shows complete end-to-end processing (OCR → Classification → Extraction)
48+
- Uses AWS services (S3, Textract, Bedrock)
49+
- Demonstrates Document object creation and manipulation
50+
- Showcases how to access and utilize extraction results
51+
- Provides a template for custom implementations
52+
- Includes granular package installation examples (`pip install "idp_common_pkg[ocr,classification,extraction]"`)
53+
54+
This refactoring sets the foundation for more maintainable, extensible document processing workflows with clearer data flow and easier troubleshooting.
55+
56+
### Refactored publish.sh script
57+
- improved modularity with functions
58+
- improved checksum logic to determine when to rebuild components

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -242,15 +242,17 @@ Navigate into the project root directory and, in a bash shell, run:
242242
243243
**This completes the preparation stage of the installation process. The process now proceeds to the Cloudformation stack installation stage.**
244244

245-
When completed, it displays the CloudFormation templates S3 URLs, 1-click URLs for launching the stack creation in CloudFormation console, and a command to deploy from the CLI:
245+
When completed, it displays the CloudFormation templates S3 URL, and a 1-click URLs for launching the stack creation in CloudFormation console:
246246
```
247247
OUTPUTS
248-
Template URL: https://s3.us-east-1.amazonaws.com/bobs-artifacts-us-east-1/transflo-idp/packaged.yaml
249-
CF Launch URL: https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?templateURL=https://s3.us-east-1.amazonaws.com/bobs-artifacts-us-east-1/transflo-idp/packaged.yaml&stackName=IDP
250-
CLI Deploy: aws cloudformation deploy --region us-east-1 --template-file /tmp/1132557/packaged.yaml --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND --stack-name <your_stack_name>>
248+
Template URL: https://s3.<region>.amazonaws.com/<cfn_bucket_basename>-<region>/<cfn_prefix>/packaged.yaml
249+
1-Click Launch URL: https://<region>.console.aws.amazon.com/cloudformation/home?region=<region>#/stacks/create/review?templateURL=https://s3.<region>.amazonaws.com/<cfn_bucket_basename>-<region>/<cfn_prefix>/packaged.yaml&stackName=IDP
251250
Done
252251
```
253252

253+
** Recommended: Deploy using AWS CloudFormation console.**
254+
For your first time deployment, log in to your AWS account and then use the `1-Click Launch URL` to create a new stack with CloudFormation. It's easier to inspect the available parameter options using the console initially. The CLI option below is better suited for scripted / automated deployments, and requires that you already know the right parameter values to use.
255+
254256
```bash
255257
# To install from the CLI the `CLI Deploy` command will be similar to the following:
256258
aws cloudformation deploy \
@@ -267,7 +269,7 @@ aws cloudformation deploy \
267269
* `<the-pattern-name-here>` should be one of the valid pattern names encased in quotes. (Each pattern may have their own required parameter overrides, see README documentation for details.)
268270
* `Pattern3 - Packet processing with Textract, SageMaker(UDOP), and Bedrock`
269271
* `Pattern2 - Packet processing with Textract and Bedrock`
270-
* (This is a great pattern to start with to try out the solution because it has not further dependencies.)
272+
* (This is a great pattern to start with to try out the solution because it has no further dependencies.)
271273
* `Pattern1 - Packet or Media processing with Bedrock Data Automation (BDA)`
272274

273275
After you have deployed the stack, check the Outputs tab to inspect names and links to the dashboards, buckets, workflows and other solution resources.

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.2.14
1+
0.2.17

lib/idp_common_pkg/README.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# IDP Common Package
2+
3+
This package contains common utilities and services for the GenAI IDP Accelerator patterns.
4+
5+
## Components
6+
7+
### Core Data Model
8+
9+
- **Document Model**: Central data structure for the entire IDP pipeline ([models.py](idp_common/models.py))
10+
11+
### Core Services
12+
13+
- **OCR**: Document OCR processing with AWS Textract ([README](idp_common/ocr/README.md))
14+
- **Classification**: Document classification using LLMs and SageMaker/UDOP ([README](idp_common/classification/README.md))
15+
- **Extraction**: Field extraction from documents using LLMs ([README](idp_common/extraction/README.md))
16+
17+
### AWS Service Clients
18+
19+
- Bedrock client with retry logic
20+
- S3 client operations
21+
- CloudWatch metrics
22+
23+
### Configuration
24+
25+
- DynamoDB-based configuration management
26+
- Support for default and custom configuration merging
27+
28+
### Image Processing
29+
30+
- Image resizing and preparation
31+
- Support for multimodal inference with Bedrock
32+
33+
### Utils
34+
35+
- Retry/backoff algorithm
36+
- S3 URI parsing
37+
- Metering data aggregation
38+
39+
## Unified Document-based Architecture
40+
41+
All core services (OCR, Classification, and Extraction) have been refactored to use a unified Document model approach:
42+
43+
```python
44+
from idp_common import get_config
45+
from idp_common.models import Document
46+
from idp_common import ocr, classification, extraction
47+
48+
# Initialize document
49+
document = Document(
50+
id="doc-123",
51+
input_bucket="my-input-bucket",
52+
input_key="documents/sample.pdf",
53+
output_bucket="my-output-bucket"
54+
)
55+
56+
# Get configuration
57+
config = get_config()
58+
59+
# Process with OCR
60+
ocr_service = ocr.OcrService(config=config)
61+
document = ocr_service.process_document(document)
62+
63+
# Perform classification (supports both Bedrock and SageMaker/UDOP backends)
64+
classification_service = classification.ClassificationService(
65+
config=config,
66+
backend="bedrock" # or "sagemaker" for SageMaker UDOP model
67+
)
68+
document = classification_service.classify_document(document)
69+
70+
# Extract information from a section
71+
extraction_service = extraction.ExtractionService(config=config)
72+
document = extraction_service.process_document_section(
73+
document=document,
74+
section_id=document.sections[0].section_id
75+
)
76+
77+
# Access the extraction results URI
78+
result_uri = document.sections[0].extraction_result_uri
79+
```
80+
81+
## Service Modules
82+
83+
### Document Model (`models.py`)
84+
85+
The central data model for the IDP processing pipeline:
86+
- Represents the state of a document as it moves through processing
87+
- Tracks pages, sections, processing status, and results
88+
- Common data structure shared between all services
89+
90+
### OCR Service (`ocr`)
91+
92+
Provides OCR processing of documents using AWS Textract:
93+
- Document-based OCR processing with the `process_document()` method
94+
- Multi-page document processing with thread concurrency
95+
- Image extraction and optimization
96+
- Support for enhanced Textract features (TABLES, FORMS, SIGNATURES, LAYOUT) with granular control
97+
- Rich markdown output for tables and forms preservation
98+
- Well-structured results for downstream processing
99+
100+
### Classification Service (`classification`)
101+
102+
Document classification using multimodal LLMs:
103+
- Document-based classification with the `classify_document()` method
104+
- Support for both Bedrock and SageMaker backends
105+
- Page-level and document-level classification
106+
- Section detection for multi-class documents
107+
- Configurable document types and descriptions
108+
- Multimodal classification with both text and images
109+
110+
### Extraction Service (`extraction`)
111+
112+
Field extraction from documents using multimodal LLMs:
113+
- Document-based extraction with the `process_document_section()` method
114+
- Extraction of structured data from document sections
115+
- Support for document class-specific attribute definitions
116+
- Multimodal extraction using both text and images
117+
- Flexible prompt templates configurable via the configuration system
118+
- Results stored in S3 with URIs tracked in the Document model
119+
120+
## Basic Usage
121+
122+
```python
123+
from idp_common import (
124+
bedrock, # Bedrock client and operations
125+
s3, # S3 operations
126+
metrics, # CloudWatch metrics
127+
image, # Image processing
128+
utils, # General utilities
129+
config, # Configuration module
130+
get_config, # Direct access to the configuration function
131+
ocr, # OCR service and models
132+
classification, # Classification service and models
133+
extraction # Extraction service and models
134+
)
135+
from idp_common.models import Document, Status
136+
137+
# Get configuration (merged from Default and Custom records in the DynamoDb Configuration Table)
138+
cfg = get_config()
139+
140+
# Create a document object
141+
document = Document(
142+
input_bucket="my-bucket",
143+
input_key="my-document.pdf",
144+
output_bucket="output-bucket"
145+
)
146+
147+
# OCR Processing
148+
ocr_service = ocr.OcrService() # Basic text detection
149+
# ocr_service = ocr.OcrService(enhanced_features=["TABLES", "FORMS"]) # Enhanced features
150+
document = ocr_service.process_document(document)
151+
152+
# Document Classification (choose your backend)
153+
classification_service = classification.ClassificationService(
154+
config=cfg,
155+
backend="bedrock" # or "sagemaker" for UDOP model
156+
)
157+
document = classification_service.classify_document(document)
158+
159+
# Field Extraction for a section
160+
extraction_service = extraction.ExtractionService(config=cfg)
161+
document = extraction_service.process_document_section(document, section_id="section-1")
162+
163+
# Publish a metric
164+
metrics.put_metric("MetricName", 1)
165+
166+
# Invoke Bedrock
167+
response = bedrock.invoke_model(...)
168+
169+
# Read from S3
170+
content = s3.get_text_content("s3://bucket/key.json")
171+
172+
# Process an image for model input
173+
image_bytes = image.prepare_image("s3://bucket/image.jpg")
174+
175+
# Parse S3 URI
176+
bucket, key = utils.parse_s3_uri("s3://bucket/key")
177+
```
178+
179+
## Configuration
180+
181+
The configuration module provides a way to retrieve and merge configuration from DynamoDB. It expects:
182+
183+
1. A DynamoDB table with a primary key named 'Configuration'
184+
2. Two configuration items with keys 'Default' and 'Custom'
185+
186+
The `get_config()` function retrieves both configurations and merges them, with custom values taking precedence over default ones.
187+
188+
```python
189+
# Get configuration with default table name from CONFIGURATION_TABLE_NAME environment variable
190+
config = get_config()
191+
192+
# Or specify a table name explicitly
193+
config = get_config(table_name="my-config-table")
194+
```
195+
196+
## Installation with Granular Dependencies
197+
198+
To minimize Lambda package size, you can install only the specific components you need:
199+
200+
```bash
201+
# Install core functionality only (minimal dependencies)
202+
pip install "idp_common[core]"
203+
204+
# Install with OCR support
205+
pip install "idp_common[ocr]"
206+
207+
# Install with classification support
208+
pip install "idp_common[classification]"
209+
210+
# Install with extraction support
211+
pip install "idp_common[extraction]"
212+
213+
# Install with image processing support
214+
pip install "idp_common[image]"
215+
216+
# Install everything
217+
pip install "idp_common[all]"
218+
219+
# Install multiple components
220+
pip install "idp_common[ocr,classification]"
221+
```
222+
223+
For Lambda functions, specify only the required components in requirements.txt:
224+
225+
```
226+
../../lib/idp_common_pkg[extraction]
227+
```
228+
229+
This ensures that only the necessary dependencies are included in your Lambda deployment package.
230+
231+
## Development Notes
232+
233+
This package has been refactored to use a unified Document-based approach across all services:
234+
235+
1. All services now accept and return Document objects
236+
2. Each service updates the Document with its results
237+
3. Results are properly encapsulated in the Document model
238+
4. Large results (like extraction attributes) are stored in S3 with only URIs in the Document
239+
240+
Key benefits:
241+
- Consistency across all services
242+
- Simplified data flow in serverless functions
243+
- Better resource usage with the focused document pattern
244+
- Improved maintainability with standardized interfaces
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Use true lazy loading for all submodules
2+
__version__ = "0.1.0"
3+
4+
# Cache for lazy-loaded submodules
5+
_submodules = {}
6+
7+
def __getattr__(name):
8+
"""Lazy load submodules only when accessed"""
9+
if name in ['bedrock', 's3', 'metrics', 'image', 'utils', 'config', 'ocr', 'classification', 'extraction', 'models']:
10+
if name not in _submodules:
11+
_submodules[name] = __import__(f"idp_common.{name}", fromlist=['*'])
12+
return _submodules[name]
13+
14+
# Special handling for directly exposed functions
15+
if name == 'get_config':
16+
config = __getattr__('config')
17+
return config.get_config
18+
19+
# Special handling for directly exposed classes
20+
if name in ['Document', 'Page', 'Section', 'Status']:
21+
models = __getattr__('models')
22+
return getattr(models, name)
23+
24+
raise AttributeError(f"module 'idp_common' has no attribute '{name}'")
25+
26+
__all__ = [
27+
'bedrock', 's3', 'metrics', 'image', 'utils', 'config', 'ocr', 'classification', 'extraction', 'models',
28+
'get_config', 'Document', 'Page', 'Section', 'Status'
29+
]

0 commit comments

Comments
 (0)