You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,6 +45,9 @@ SPDX-License-Identifier: MIT-0
45
45
-**Fixed CloudWatch Log Group Missing Retention regression**
46
46
-**Security: Cross-Site Scripting (XSS) Vulnerability in FileViewer Component** - Fixed high-risk XSS vulnerability in `src/ui/src/components/document-viewer/FileViewer.jsx` where `innerHTML` was used with user-controlled data
47
47
-**Add permissions boundary support to new Lambda function roles introduced in previous releases**
48
+
-**Fixed OutOfMemory Errors in Pattern-2 OCR Lambda for Large High-Resolution Documents**
49
+
-**Root Cause**: Processing large PDFs with high-resolution images (7469×9623 pixels) caused memory spikes when 20 concurrent workers each held ~101MB images simultaneously, exceeding the 4GB Lambda memory limit
50
+
-**Optimal Solution**: Refactored image extraction to render directly at target dimensions using PyMuPDF matrix transformations, completely eliminating oversized image creation
Copy file name to clipboardExpand all lines: lib/idp_common_pkg/idp_common/ocr/README.md
+23-4Lines changed: 23 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,19 +122,38 @@ ocr:
122
122
task_prompt: "Extract all text from this image..."
123
123
```
124
124
125
+
### Memory-Optimized Image Extraction
126
+
127
+
The OCR service uses advanced memory optimization to prevent OutOfMemory errors when processing large high-resolution documents:
128
+
129
+
**Direct Size Extraction**: When resize configuration is provided (`target_width` and `target_height`), images are extracted directly at the target dimensions using PyMuPDF matrix transformations. This completely eliminates memory spikes from creating oversized images.
- **Optimized approach**: Extract directly at 951×1268 (5MB) → No memory spike
134
+
135
+
**Preserved Logic**: The optimization maintains all existing resize behavior:
136
+
- ✅ Never upscales images (only applies scaling when scale_factor < 1.0)
137
+
- ✅ Preserves aspect ratio using `min(width_ratio, height_ratio)`
138
+
- ✅ Handles edge cases (no config, images already smaller than targets)
139
+
- ✅ Full backward compatibility
140
+
125
141
### DPI Configuration
126
142
127
-
The DPI (dots per inch) setting controls the resolution when extracting images from PDF pages:
143
+
The DPI (dots per inch) setting controls the base resolution when extracting images from PDF pages:
128
144
- **Default**: 150 DPI (good balance of quality and file size)
129
-
- **Range**: 72-300 DPI
145
+
- **Range**: 72-300 DPI
130
146
- **Location**: `ocr.image.dpi` in the configuration
131
147
- **Behavior**:
132
148
- Only applies to PDF files (image files maintain their original resolution)
133
-
- Higher DPI = better quality but larger file sizes
149
+
- Combined with resize configuration for optimal memory usage
150
+
- Higher DPI = better quality but larger file sizes (use with resize config for large documents)
134
151
- 150 DPI is recommended for most OCR use cases
135
-
- 300 DPI for documents with small text or fine details
152
+
- 300 DPI for documents with small text or fine details (ensure resize config is set)
136
153
- 100 DPI for simple documents to reduce processing time
137
154
155
+
**Memory Considerations**: For large documents with high DPI settings, always configure `target_width` and `target_height` to prevent memory issues. The service will intelligently extract at the optimal size.
0 commit comments