Reviewed: 2026-03-07
Release target: 0.1.9
This review covers the current TejOCR runtime and packaging state after the OCR hardening work:
- OCR engine execution path
- PDF helper/runtime detection
- setup and diagnostics command generation
- user-configurable executable paths
- logging and local data exposure
- package metadata consistency
TejOCR remains a local, offline LibreOffice extension with no network transport, no remote code download, and no document-macro execution path added by the OCR runtime. That keeps the external attack surface relatively small.
There is no obvious remote-code-execution issue in the hardened OCR pipeline as currently implemented. The main remaining risks are:
- Trusted executable path risk: a user- or environment-provided Tesseract path is still treated as trusted and may point to any local executable.
- Local privacy exposure risk: OCR output and source metadata may be shown in dialogs, copied to clipboard, or exposed to a local user via the log file and UI workflow.
- Packaging consistency risk: release metadata and manifest references must stay aligned with packaged files to avoid stale or broken install-time metadata behavior.
The runtime previously risked launching unsafe LibreOffice helper wrappers on macOS while building install guidance. The hardened runtime now:
- rejects
Contents/Resources/python - rejects
LibreOfficePython - rejects helper scripts like
python3-config/python3.11-config - avoids executing candidate Python interpreters just to print pip commands
This materially reduced a local stability and trust issue in setup/diagnostics and PDF-helper guidance paths.
Preset behavior is now bounded and explicit:
Fast: 1 exact attemptBalanced: 1 exact attempt + at most 1 recoveryAccuracy: 1 exact attempt + at most 1 enhanced recoveryCustom: strict user-selected values
This reduces the old "unexpected fallback chain" behavior that made OCR output harder to reason about.
PDF OCR now:
- streams pages instead of rasterizing the full document up front
- uses preset-driven base DPI
- rerenders weak pages when necessary
This reduces resource spikes and makes PDF behavior more predictable for both single-page and multi-page documents.
Status: still present by design
Severity: Medium
Applies to: local machine trust boundary only
TejOCR allows users or environment variables to provide a custom Tesseract executable path. If a malicious or incorrect executable is configured, TejOCR will treat it as the OCR engine and execute it.
Why this is not currently a remote vulnerability:
- the path must already be controlled locally,
- TejOCR does not fetch executables over the network,
- this is equivalent to trusting a local OCR binary configured by the user or administrator.
Recommended follow-up:
- warn explicitly in UI that custom Tesseract path must point to a trusted local binary,
- optionally verify basename/metadata more strictly before save/run,
- add a "reset to auto-detect" fast path in settings.
Status: still present
Severity: Medium
Applies to: local privacy, not code execution
OCR output may contain sensitive personal or document data. That data can currently be:
- shown in review dialogs,
- shown in completion dialogs,
- copied to the system clipboard,
- visible in document insertion targets,
- partially reflected in source labels and summaries.
Current mitigating factors:
- runtime diagnostics do not intentionally log full OCR text,
- preview fallback truncates oversized dialog content,
- logging is configured to favor
ERRORlevel in production.
Recommended follow-up:
- keep full OCR text out of logs,
- reduce how much OCR content is shown in completion dialogs by default,
- favor compact summaries plus explicit technical-details sections,
- document clipboard/privacy implications in user help.
Status: partially mitigated in 0.1.9
Severity: Medium
Applies to: installation integrity and release correctness
description.xml and META-INF/manifest.xml reference release-note and description files. If the build script omits those files, LibreOffice metadata behavior can become stale or inconsistent.
Recommended follow-up:
- keep
build_tejocr.py,description.xml, andMETA-INF/manifest.xmlin sync, - validate packaged files against manifest entries as part of release build checks,
- keep release notes and description text files included in the final
.oxt.
- Hidden executor rollback (
HiddenOcrExecutor) is a maintainer rollback switch, not a user-facing security control. - OCR preview and completion density are usability issues first; they do not currently amount to a security defect by themselves.
- The extension remains offline and local-only; no cloud OCR or remote API path is involved.
- Add a warning label near custom Tesseract path that this field must point to a trusted local executable.
- Add a release-build check that compares
META-INF/manifest.xmlentries with actual packaged files. - Redesign review/completion/setup dialogs so sensitive content is shown more deliberately and technical details are not dumped by default.
- Keep runtime helper detection purely non-executing for any future install-guidance features.
python3 -m unittest discover -s tests- package build includes release-note and descriptor text files referenced by metadata
- settings/setup no longer spawn unsafe LibreOffice Python helper launchers
- OCR logs do not contain full extracted document text during normal success paths