|
| 1 | +# IDP Common Package |
| 2 | + |
| 3 | +This package contains common utilities and services for the GenAI IDP Accelerator patterns. |
| 4 | + |
| 5 | +## Components |
| 6 | + |
| 7 | +### Core Data Model |
| 8 | + |
| 9 | +- **Document Model**: Central data structure for the entire IDP pipeline ([models.py](idp_common/models.py)) |
| 10 | + |
| 11 | +### Core Services |
| 12 | + |
| 13 | +- **OCR**: Document OCR processing with AWS Textract ([README](idp_common/ocr/README.md)) |
| 14 | +- **Classification**: Document classification using LLMs and SageMaker/UDOP ([README](idp_common/classification/README.md)) |
| 15 | +- **Extraction**: Field extraction from documents using LLMs ([README](idp_common/extraction/README.md)) |
| 16 | + |
| 17 | +### AWS Service Clients |
| 18 | + |
| 19 | +- Bedrock client with retry logic |
| 20 | +- S3 client operations |
| 21 | +- CloudWatch metrics |
| 22 | + |
| 23 | +### Configuration |
| 24 | + |
| 25 | +- DynamoDB-based configuration management |
| 26 | +- Support for default and custom configuration merging |
| 27 | + |
| 28 | +### Image Processing |
| 29 | + |
| 30 | +- Image resizing and preparation |
| 31 | +- Support for multimodal inference with Bedrock |
| 32 | + |
| 33 | +### Utils |
| 34 | + |
| 35 | +- Retry/backoff algorithm |
| 36 | +- S3 URI parsing |
| 37 | +- Metering data aggregation |
| 38 | + |
| 39 | +## Unified Document-based Architecture |
| 40 | + |
| 41 | +All core services (OCR, Classification, and Extraction) have been refactored to use a unified Document model approach: |
| 42 | + |
| 43 | +```python |
| 44 | +from idp_common import get_config |
| 45 | +from idp_common.models import Document |
| 46 | +from idp_common import ocr, classification, extraction |
| 47 | + |
| 48 | +# Initialize document |
| 49 | +document = Document( |
| 50 | + id="doc-123", |
| 51 | + input_bucket="my-input-bucket", |
| 52 | + input_key="documents/sample.pdf", |
| 53 | + output_bucket="my-output-bucket" |
| 54 | +) |
| 55 | + |
| 56 | +# Get configuration |
| 57 | +config = get_config() |
| 58 | + |
| 59 | +# Process with OCR |
| 60 | +ocr_service = ocr.OcrService(config=config) |
| 61 | +document = ocr_service.process_document(document) |
| 62 | + |
| 63 | +# Perform classification (supports both Bedrock and SageMaker/UDOP backends) |
| 64 | +classification_service = classification.ClassificationService( |
| 65 | + config=config, |
| 66 | + backend="bedrock" # or "sagemaker" for SageMaker UDOP model |
| 67 | +) |
| 68 | +document = classification_service.classify_document(document) |
| 69 | + |
| 70 | +# Extract information from a section |
| 71 | +extraction_service = extraction.ExtractionService(config=config) |
| 72 | +document = extraction_service.process_document_section( |
| 73 | + document=document, |
| 74 | + section_id=document.sections[0].section_id |
| 75 | +) |
| 76 | + |
| 77 | +# Access the extraction results URI |
| 78 | +result_uri = document.sections[0].extraction_result_uri |
| 79 | +``` |
| 80 | + |
| 81 | +## Service Modules |
| 82 | + |
| 83 | +### Document Model (`models.py`) |
| 84 | + |
| 85 | +The central data model for the IDP processing pipeline: |
| 86 | +- Represents the state of a document as it moves through processing |
| 87 | +- Tracks pages, sections, processing status, and results |
| 88 | +- Common data structure shared between all services |
| 89 | + |
| 90 | +### OCR Service (`ocr`) |
| 91 | + |
| 92 | +Provides OCR processing of documents using AWS Textract: |
| 93 | +- Document-based OCR processing with the `process_document()` method |
| 94 | +- Multi-page document processing with thread concurrency |
| 95 | +- Image extraction and optimization |
| 96 | +- Support for enhanced Textract features (TABLES, FORMS, SIGNATURES, LAYOUT) with granular control |
| 97 | +- Rich markdown output for tables and forms preservation |
| 98 | +- Well-structured results for downstream processing |
| 99 | + |
| 100 | +### Classification Service (`classification`) |
| 101 | + |
| 102 | +Document classification using multimodal LLMs: |
| 103 | +- Document-based classification with the `classify_document()` method |
| 104 | +- Support for both Bedrock and SageMaker backends |
| 105 | +- Page-level and document-level classification |
| 106 | +- Section detection for multi-class documents |
| 107 | +- Configurable document types and descriptions |
| 108 | +- Multimodal classification with both text and images |
| 109 | + |
| 110 | +### Extraction Service (`extraction`) |
| 111 | + |
| 112 | +Field extraction from documents using multimodal LLMs: |
| 113 | +- Document-based extraction with the `process_document_section()` method |
| 114 | +- Extraction of structured data from document sections |
| 115 | +- Support for document class-specific attribute definitions |
| 116 | +- Multimodal extraction using both text and images |
| 117 | +- Flexible prompt templates configurable via the configuration system |
| 118 | +- Results stored in S3 with URIs tracked in the Document model |
| 119 | + |
| 120 | +## Basic Usage |
| 121 | + |
| 122 | +```python |
| 123 | +from idp_common import ( |
| 124 | + bedrock, # Bedrock client and operations |
| 125 | + s3, # S3 operations |
| 126 | + metrics, # CloudWatch metrics |
| 127 | + image, # Image processing |
| 128 | + utils, # General utilities |
| 129 | + config, # Configuration module |
| 130 | + get_config, # Direct access to the configuration function |
| 131 | + ocr, # OCR service and models |
| 132 | + classification, # Classification service and models |
| 133 | + extraction # Extraction service and models |
| 134 | +) |
| 135 | +from idp_common.models import Document, Status |
| 136 | + |
| 137 | +# Get configuration (merged from Default and Custom records in the DynamoDb Configuration Table) |
| 138 | +cfg = get_config() |
| 139 | + |
| 140 | +# Create a document object |
| 141 | +document = Document( |
| 142 | + input_bucket="my-bucket", |
| 143 | + input_key="my-document.pdf", |
| 144 | + output_bucket="output-bucket" |
| 145 | +) |
| 146 | + |
| 147 | +# OCR Processing |
| 148 | +ocr_service = ocr.OcrService() # Basic text detection |
| 149 | +# ocr_service = ocr.OcrService(enhanced_features=["TABLES", "FORMS"]) # Enhanced features |
| 150 | +document = ocr_service.process_document(document) |
| 151 | + |
| 152 | +# Document Classification (choose your backend) |
| 153 | +classification_service = classification.ClassificationService( |
| 154 | + config=cfg, |
| 155 | + backend="bedrock" # or "sagemaker" for UDOP model |
| 156 | +) |
| 157 | +document = classification_service.classify_document(document) |
| 158 | + |
| 159 | +# Field Extraction for a section |
| 160 | +extraction_service = extraction.ExtractionService(config=cfg) |
| 161 | +document = extraction_service.process_document_section(document, section_id="section-1") |
| 162 | + |
| 163 | +# Publish a metric |
| 164 | +metrics.put_metric("MetricName", 1) |
| 165 | + |
| 166 | +# Invoke Bedrock |
| 167 | +response = bedrock.invoke_model(...) |
| 168 | + |
| 169 | +# Read from S3 |
| 170 | +content = s3.get_text_content("s3://bucket/key.json") |
| 171 | + |
| 172 | +# Process an image for model input |
| 173 | +image_bytes = image.prepare_image("s3://bucket/image.jpg") |
| 174 | + |
| 175 | +# Parse S3 URI |
| 176 | +bucket, key = utils.parse_s3_uri("s3://bucket/key") |
| 177 | +``` |
| 178 | + |
| 179 | +## Configuration |
| 180 | + |
| 181 | +The configuration module provides a way to retrieve and merge configuration from DynamoDB. It expects: |
| 182 | + |
| 183 | +1. A DynamoDB table with a primary key named 'Configuration' |
| 184 | +2. Two configuration items with keys 'Default' and 'Custom' |
| 185 | + |
| 186 | +The `get_config()` function retrieves both configurations and merges them, with custom values taking precedence over default ones. |
| 187 | + |
| 188 | +```python |
| 189 | +# Get configuration with default table name from CONFIGURATION_TABLE_NAME environment variable |
| 190 | +config = get_config() |
| 191 | + |
| 192 | +# Or specify a table name explicitly |
| 193 | +config = get_config(table_name="my-config-table") |
| 194 | +``` |
| 195 | + |
| 196 | +## Installation with Granular Dependencies |
| 197 | + |
| 198 | +To minimize Lambda package size, you can install only the specific components you need: |
| 199 | + |
| 200 | +```bash |
| 201 | +# Install core functionality only (minimal dependencies) |
| 202 | +pip install "idp_common[core]" |
| 203 | + |
| 204 | +# Install with OCR support |
| 205 | +pip install "idp_common[ocr]" |
| 206 | + |
| 207 | +# Install with classification support |
| 208 | +pip install "idp_common[classification]" |
| 209 | + |
| 210 | +# Install with extraction support |
| 211 | +pip install "idp_common[extraction]" |
| 212 | + |
| 213 | +# Install with image processing support |
| 214 | +pip install "idp_common[image]" |
| 215 | + |
| 216 | +# Install everything |
| 217 | +pip install "idp_common[all]" |
| 218 | + |
| 219 | +# Install multiple components |
| 220 | +pip install "idp_common[ocr,classification]" |
| 221 | +``` |
| 222 | + |
| 223 | +For Lambda functions, specify only the required components in requirements.txt: |
| 224 | + |
| 225 | +``` |
| 226 | +../../lib/idp_common_pkg[extraction] |
| 227 | +``` |
| 228 | + |
| 229 | +This ensures that only the necessary dependencies are included in your Lambda deployment package. |
| 230 | + |
| 231 | +## Development Notes |
| 232 | + |
| 233 | +This package has been refactored to use a unified Document-based approach across all services: |
| 234 | + |
| 235 | +1. All services now accept and return Document objects |
| 236 | +2. Each service updates the Document with its results |
| 237 | +3. Results are properly encapsulated in the Document model |
| 238 | +4. Large results (like extraction attributes) are stored in S3 with only URIs in the Document |
| 239 | + |
| 240 | +Key benefits: |
| 241 | +- Consistency across all services |
| 242 | +- Simplified data flow in serverless functions |
| 243 | +- Better resource usage with the focused document pattern |
| 244 | +- Improved maintainability with standardized interfaces |
0 commit comments