-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Use this guide to set up TejOCR and Tesseract OCR for each supported operating system.
- Prerequisites
- Download and Install TejOCR
- Install Tesseract OCR
- Install LibreOffice Python Dependencies
- Verify Installation
- Troubleshooting
- References
- LibreOffice installed
- Internet access for package downloads
- Permissions to install software/packages on your machine
- Open LibreOffice.
- Go to Tools → Extension Manager → Add.
- Select the latest
TejOCR-0.1.7.oxtfile. - Restart LibreOffice after install.
- Open Writer and confirm menu entry: Tools → TejOCR.
Install core OCR engine first. This is required by TejOCR.
brew install tesseractCheck:
which tesseract
tesseract --versionsudo apt update
sudo apt install -y tesseract-ocrCheck:
which tesseract
tesseract --versionsudo dnf install -y tesseractCheck:
which tesseract
tesseract --versionUse a Windows Tesseract installer build from: https://github.com/UB-Mannheim/tesseract/wiki
After installation, verify:
where tesseract
tesseract --versionInstall extra language packages if needed (examples):
# Ubuntu/Debian
sudo apt install -y tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu
# macOS (formula specific languages may vary by package source)
brew install tesseract-langTejOCR runs in LibreOffice’s embedded Python runtime, so dependencies must be installed there.
Open a terminal and run:
"/Applications/LibreOffice.app/Contents/Frameworks/LibreOfficePython.framework/Versions/Current/bin/python3" --versionIf this path exists, install:
"/Applications/LibreOffice.app/Contents/Frameworks/LibreOfficePython.framework/Versions/Current/bin/python3" -m pip install numpy pytesseract pillow"C:\Program Files\LibreOffice\program\python.exe" --version"C:\Program Files\LibreOffice\program\python.exe" -m pip install numpy pytesseract pillowCommon paths vary by distribution:
/opt/libreoffice* /usr/bin/libreoffice /usr/lib/libreoffice/programFind the exact binary, then:
/path/to/libreoffice/python -m pip install numpy pytesseract pillowFrom the TejOCR folder:
python3 install_dependencies.pyIf script is not working in your environment, use the manual commands above.
Use TejOCR UI:
- Open LibreOffice Writer.
- Go to Tools → TejOCR → Settings.
- Confirm:
- Tesseract status shows installed version
- Python dependency status shows NumPy, Pytesseract, Pillow as available
Use CLI quick check:
tesseract --versionpython3 install_dependencies.pyInside LibreOffice:
- open Tools → TejOCR → Settings
- test Tesseract path and dependencies directly from UI
Most often this is caused by invalid extension metadata or missing references.
Check:
-
description.xmlis valid XML - License path is present and correct (referenced file exists in extension package)
- No malformed XML entities in metadata files
- Icon paths in
description.xmlare valid and point to existing files
Then rebuild and reinstall the .oxt.
- Ensure the same tesseract binary used in terminal is also reachable from LibreOffice runtime.
- Reinstall LO Python packages using the exact LO Python path.
- Restart LibreOffice and reopen the Settings page to force refreshed checks.
- If using image replacement mode, confirm selected object is a supported image/shape.
- For cursor insertion, keep cursor in a text area and avoid selection of unsupported elements.
TejOCR has two places where these values are configured.
-
Settings (
Tools → TejOCR → Settings) stores defaults that persist across sessions.-
DefaultQualityPreset(fast,balanced,accurate,custom) DefaultPSMDefaultOEMDefaultScaleFactor- grayscale / binarize / invert / improve image flags
ShowPreviewBeforeOutput
-
-
OCR Options dialog for each run (
OCR Selected ImageorOCR Image from File) can override the defaults with the same fields before execution.
This means users get stable defaults in Settings, and still can experiment per image in the options dialog.
Preset is a profile that applies an initial set of values to advanced controls.
-
fast(psm=11,oem=3, scale1.0, grayscale off, binarize off) -
balanced(default):psm=3,oem=3, scale1.0, grayscale on -
accurate:psm=6,oem=3, scale1.5, grayscale on, binarize on, improve image on -
custom: uses the manualpsm,oem, scale, and preprocessing values directly
When custom is chosen, the engine uses the current manual values from UI values.
PSM controls how Tesseract prepares page layout before recognition.
| Mode | Meaning |
|---|---|
0 |
Orientation and script detection only |
1 |
Automatic page segmentation with OSD |
2 |
Automatic page segmentation, no OSD |
3 |
Fully automatic, no OSD (default) |
4 |
Single column of text with variable sizes |
5 |
Single uniform block of vertical text |
6 |
Single uniform block of text |
7 |
Single text line |
8 |
Single word |
9 |
Single word in a circle |
10 |
Single character |
11 |
Sparse text |
12 |
Sparse text with OSD |
13 |
Raw line |
| Mode | Meaning |
|---|---|
0 |
Legacy engine only |
1 |
Neural nets LSTM only |
2 |
Legacy + LSTM |
3 |
Auto selection (default) |
-
ShowPreviewBeforeOutputcontrols whether OCR text is shown in a preview window before insertion. - If the session does not support LibreOffice multiline dialog controls, TejOCR uses a compatibility preview summary and proceeds with insertion when allowed.
- If preview is disabled, text is inserted immediately in the selected output mode.
Preview can be toggled in Settings and for each run in OCR options UI.
flowchart TD
classDef start fill:#0f62fe,color:#ffffff,stroke:#003cb3,stroke-width:1.5px
classDef process fill:#1f6feb,color:#ffffff,stroke:#1347a0,stroke-width:1px
classDef decision fill:#f7b731,color:#1f2937,stroke:#b5880a,stroke-width:1.5px
classDef success fill:#22c55e,color:#ffffff,stroke:#15803d,stroke-width:1px
classDef fallback fill:#ef4444,color:#ffffff,stroke:#991b1b,stroke-width:1px
classDef preview fill:#fb7185,color:#ffffff,stroke:#be123c,stroke-width:1px
A["User starts OCR action"]:::start --> B["Load default OCR options from settings"]:::process
B --> C["Read current OCR options dialog values"]:::process
C --> D{"Preset = custom?"}:::decision
D -- No --> E["Apply preset profile (psm, oem, scale, preprocessing)"]:::process
D -- Yes --> F["Use manual option values from dialog"]:::process
E --> G["Final options object"]:::process
F --> G
G --> H["perform_ocr()"]:::process
H --> I["Run OCR attempts: fallback OEM list and fallback PSM list"]:::fallback
I --> J{"Text found?"}:::decision
J -- yes --> K["Optional preview then insert in selected output mode"]:::success
J -- no --> L["Auto-enhanced preprocessing fallback"]:::fallback
L --> I
+------------------------------+
| Start OCR action |
+--------------+---------------+
|
v
+------------------------------+
| _build_default_ocr_options()
+--------------+---------------+
|
v
+-------------------------------+
| _normalize_dialog_result()
| - preset/psm/oem/scale flags |
+---------------+---------------+
|
+------------+-----------+
| Preset is custom? |
| no -> profile overrides|
| yes -> manual values |
+------------+-----------+
|
v
+----------------------+
| _perform_ocr_with... |
+----------+-----------+
|
+----------------------+
| _fallback_oem_values |
| _fallback_psm_values |
+----------+-----------+
|
v
+--------------------+
| Preview (if enabled)|
| then output router |
+--------------------+
- Start with
balanced + psm=3 + oem=3. - For sparse text, try
psm=11andPreset=custom. - For noisy low-contrast scans, use
Preset=accurate,scale=1.5, and keep grayscale/binarize on.
For a deeper method-level reference, see:
reference/ocr-options-and-engine-tuning.md-
python/tejocr/constants.py(preset/mode constants) -
python/tejocr/tejocr_service.py(option resolution) -
python/tejocr/tejocr_engine.py(attempt and fallback loops)
The commands below are the practical defaults used by TejOCR users.
| Task | Command |
|---|---|
| Install OCR engine | brew install tesseract |
| Install LO Python dependencies | /Applications/LibreOffice.app/Contents/Frameworks/LibreOfficePython.framework/Versions/Current/bin/python3 -m pip install numpy pytesseract pillow |
| Check OCR path | which tesseract |
| Task | Command |
|---|---|
| Install OCR engine | Download and install from UB-Mannheim release page |
| Install LO Python dependencies | "C:\\Program Files\\LibreOffice\\program\\python.exe" -m pip install numpy pytesseract pillow |
| Check OCR path | where tesseract |
| Task | Command |
|---|---|
| Install OCR engine | sudo apt update && sudo apt install -y tesseract-ocr |
| Install LO Python dependencies |
sudo apt install -y python3-pip then use LibreOffice Python path with pip |
| Check OCR path | which tesseract |
| Task | Command |
|---|---|
| Install OCR engine | sudo dnf install -y tesseract |
| Install LO Python dependencies | Use your LibreOffice Python interpreter + pip |
| Check OCR path | which tesseract |
| Task | Command |
|---|---|
| Install OCR engine | sudo pacman -S tesseract |
| Install LO Python dependencies | Use your distro package path for LibreOffice python |
| Check OCR path | which tesseract |
which tesseract
python3 -c "import sys,subprocess; print(sys.executable)"Then run pip via that exact interpreter for numpy, pytesseract, and pillow.
For exact OCR command references and project links:
- Tesseract upstream: https://github.com/tesseract-ocr/tesseract
- Tesseract docs: https://tesseract-ocr.github.io/
- Tesseract OCR source repository: https://github.com/tesseract-ocr/tesseract
- Tesseract docs and usage guides: https://tesseract-ocr.github.io/
- TejOCR repository: https://github.com/varshneydevansh/TejOCR