-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Tesseract.js version
tesseract.js v7.0.0 (latest npm release at time of writing)
Describe the bug
When using tesseract.js v7 in a long-running Node.js backend, calling worker.recognize() on image buffers causes continuous native memory growth (RSS / external / arrayBuffers), eventually exhausting system memory and swap.
This happens even though:
-
A single worker is created once and reused
-
OCR is not continuous: it is only used until a timestamp is successfully extracted, then not used for ~15 minutes, and only invoked again to refresh the offset
-
The JavaScript heap remains stable
-
The worker is not recreated
-
OCR is skipped entirely most of the time once a valid timestamp offset is cached
Disabling tesseract.js completely eliminates the memory growth, which strongly suggests a native / WASM memory leak or unreleased buffers inside tesseract.js or its dependencies.
To Reproduce
-
Create a Node.js application
-
Create a single Tesseract worker once at startup
-
Repeatedly call worker.recognize() on image buffers for a short period
-
Stop calling OCR for several minutes
-
Observe that memory does not return to baseline
-
Resume OCR later and observe further memory growth
-
Monitor process RSS / external memory over time
Simplified reproduction pattern:
import sharp from 'sharp';
const worker = await createWorker('eng');
await worker.setParameters({
tessedit_char_whitelist: '0123456789:/- ',
});
async function runOcr(frame: Buffer) {
const buffer = await sharp(frame)
.extract({ left: 0, top: 0, width: 300, height: 80 })
.grayscale()
.normalize()
.threshold(180)
.toBuffer();
await worker.recognize(buffer);
}
// OCR is called until timestamp is extracted,
// then skipped for ~15 minutes, then called again
Memory monitoring used to confirm the issue:
const usage = process.memoryUsage();
console.log(
`Memory Usage: RSS=${(usage.rss / 1024 / 1024).toFixed(2)}MB, ` +
`HeapUsed=${(usage.heapUsed / 1024 / 1024).toFixed(2)}MB, ` +
`arrayBuffersUsed=${(usage.arrayBuffers / 1024 / 1024).toFixed(2)}MB, ` +
`externalUsed=${(usage.external / 1024 / 1024).toFixed(2)}MB`,
);
}, 60000);
Observed behavior:
-
heapUsed stays relatively flat
-
RSS, external, and arrayBuffers grow steadily
-
Memory is not reclaimed during long periods where OCR is not used
-
No specific image is required; the issue reproduces with small cropped grayscale buffers from camera frames.
Expected behavior
- Native memory usage should stabilize after repeated recognize() calls
Memory should not grow when the worker is idle for long periods
Reusing a single worker intermittently should not cause unbounded RSS growth
Memory should be reused internally or released back to the OS
Device Version
OS: Debian 12 (Bookworm)
Node.js: Node v22.20.0
Additional context
-
This runs in a long-lived backend service (NestJS)
-
OCR is used only to extract a timestamp overlay from camera frames
-
Once the timestamp offset is successfully extracted, OCR is disabled and the cached offset is reused
-
Despite long idle periods, memory is not released
-
Over time this leads to system swap exhaustion
-
The worker is created once and only terminated on application shutdown
-
Concurrency / usage pattern note
In the real application, OCR is not awaited in the video frame pipeline in order to avoid blocking frame processing.
Instead, OCR runs asynchronously to update a shared timestamp offset stored in application state. The frame pipeline always reads the latest known offset (initially set to the local machine time) and continues processing frames without waiting for OCR to complete.Even with this non-blocking usage pattern and very low OCR frequency, native memory continues to grow after worker.recognize() calls and is not released during long idle periods.
This makes it difficult to safely use tesseract.js in continuous or long-running server environments.