Skip to content

Commit 2e920b4

Browse files
committed
Merge branch 'fix/oom-errors' into 'develop'
OOM errors for large documents - Issue #35 See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!249
2 parents 3daffe1 + 172f57b commit 2e920b4

File tree

4 files changed

+257
-85
lines changed

4 files changed

+257
-85
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,9 @@ SPDX-License-Identifier: MIT-0
4545
- **Fixed CloudWatch Log Group Missing Retention regression**
4646
- **Security: Cross-Site Scripting (XSS) Vulnerability in FileViewer Component** - Fixed high-risk XSS vulnerability in `src/ui/src/components/document-viewer/FileViewer.jsx` where `innerHTML` was used with user-controlled data
4747
- **Add permissions boundary support to new Lambda function roles introduced in previous releases**
48+
- **Fixed OutOfMemory Errors in Pattern-2 OCR Lambda for Large High-Resolution Documents**
49+
- **Root Cause**: Processing large PDFs with high-resolution images (7469×9623 pixels) caused memory spikes when 20 concurrent workers each held ~101MB images simultaneously, exceeding the 4GB Lambda memory limit
50+
- **Optimal Solution**: Refactored image extraction to render directly at target dimensions using PyMuPDF matrix transformations, completely eliminating oversized image creation
4851

4952
## [0.3.11]
5053

lib/idp_common_pkg/idp_common/ocr/README.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -122,19 +122,38 @@ ocr:
122122
task_prompt: "Extract all text from this image..."
123123
```
124124
125+
### Memory-Optimized Image Extraction
126+
127+
The OCR service uses advanced memory optimization to prevent OutOfMemory errors when processing large high-resolution documents:
128+
129+
**Direct Size Extraction**: When resize configuration is provided (`target_width` and `target_height`), images are extracted directly at the target dimensions using PyMuPDF matrix transformations. This completely eliminates memory spikes from creating oversized images.
130+
131+
**Example for Large Document:**
132+
- **Original approach**: Extract 7469×9623 (101MB) → Resize to 951×1268 (5MB) → Memory spike
133+
- **Optimized approach**: Extract directly at 951×1268 (5MB) → No memory spike
134+
135+
**Preserved Logic**: The optimization maintains all existing resize behavior:
136+
- ✅ Never upscales images (only applies scaling when scale_factor < 1.0)
137+
- ✅ Preserves aspect ratio using `min(width_ratio, height_ratio)`
138+
- ✅ Handles edge cases (no config, images already smaller than targets)
139+
- ✅ Full backward compatibility
140+
125141
### DPI Configuration
126142

127-
The DPI (dots per inch) setting controls the resolution when extracting images from PDF pages:
143+
The DPI (dots per inch) setting controls the base resolution when extracting images from PDF pages:
128144
- **Default**: 150 DPI (good balance of quality and file size)
129-
- **Range**: 72-300 DPI
145+
- **Range**: 72-300 DPI
130146
- **Location**: `ocr.image.dpi` in the configuration
131147
- **Behavior**:
132148
- Only applies to PDF files (image files maintain their original resolution)
133-
- Higher DPI = better quality but larger file sizes
149+
- Combined with resize configuration for optimal memory usage
150+
- Higher DPI = better quality but larger file sizes (use with resize config for large documents)
134151
- 150 DPI is recommended for most OCR use cases
135-
- 300 DPI for documents with small text or fine details
152+
- 300 DPI for documents with small text or fine details (ensure resize config is set)
136153
- 100 DPI for simple documents to reduce processing time
137154

155+
**Memory Considerations**: For large documents with high DPI settings, always configure `target_width` and `target_height` to prevent memory issues. The service will intelligently extract at the optimal size.
156+
138157

139158
## Migration Guide
140159

0 commit comments

Comments
 (0)