Fix missing or wrong PDF /ToUnicode CMap entries so text extraction and copy-paste match what you see on the page. The primary use case is Tibetan stacked syllables (Monlam, Himalaya, Jomolhari): producers often embed incomplete or incorrect Unicode mappings for ligature glyphs. The same mechanism applies to any Type0 / Identity-H font present in the bundled GSUB-derived database.
GitHub: OpenPecha/pdf-cmap-fix
Documentation: docs/README.md · Glossary & JSON formats · Approach · Font inventory (962 keys)
- Installation
- Quick start (CLI)
- Python API reference
- Bundled reverse database (font sources)
- Updating
reverse_db.jsonin the future - Migration from
tibetan-pdf-fix - Supported fonts & limits
- How it works
- Example results
- Project structure
- License
Requires Python 3.8+, PyMuPDF (fitz), and fontTools (declared in pyproject.toml).
Install the package directly from this repository:
pip install "pdf-cmap-fix @ git+https://github.com/OpenPecha/pdf-cmap-fix.git"Equivalent shorthand:
pip install git+https://github.com/OpenPecha/pdf-cmap-fix.gitEditable checkout for development:
git clone https://github.com/OpenPecha/pdf-cmap-fix.git
cd pdf-cmap-fix
pip install -e ".[dev]"Run tests:
pytestAfter install, the CLI pdf-cmap-fix is on your PATH. The bundled database ships inside the wheel/sdist as package data (pdf_cmap_fix/data/reverse_db.json).
pdf-cmap-fix document.pdf
# writes: document.raw.txt document.patched.txt document.diff.txtpdf-cmap-fix doc1.pdf doc2.pdf doc3.pdfPatched PDF only (same ToUnicode logic; does not overwrite the input):
pdf-cmap-fix --patch-pdf document.pdf
# writes: document.patched.pdf
pdf-cmap-fix -p doc1.pdf doc2.pdf # short formDump merged ToUnicode data as JSON (does not modify the PDF):
pdf-cmap-fix --dump-cmap cmap.json document.pdf
# multiple PDFs → cmap_<stem>.json per fileLarge PDFs with many Type0 font objects can make --dump-cmap slow and the JSON huge; prefer build_tounicode_dict in Python if you need to filter by font name or xref.
On Windows, the default CLI prints Tibetan previews using the console encoding; if you see UnicodeEncodeError, switch the terminal to UTF-8 (for example chcp 65001) or call extract_pdf_text(..., verbose=False) from Python so nothing is printed to the console.
Import the public API from pdf_cmap_fix:
from pdf_cmap_fix import (
extract_pdf_text,
patch_pdf,
build_tounicode_dict,
collect_font_merges,
patch_doc,
extract_all,
)Optional: load a custom JSON database with json.loads(...) and pass as rev_db= where supported.
| Function | Purpose |
|---|---|
extract_pdf_text |
Opens the PDF twice: extract raw text, then patch ToUnicode in memory and extract again. Can write .raw.txt, .patched.txt, .diff.txt. |
patch_pdf |
Applies merged ToUnicode streams and returns bytes (and optionally writes *.patched.pdf). |
build_tounicode_dict |
No PDF mutation: returns per-font existing / merged / overrides plus stats. |
collect_font_merges |
Lower-level: scan the document and compute merge records without writing streams. |
patch_doc |
Apply merges to an already-open fitz.Document using collect_font_merges + stream updates. |
extract_all |
Extract plain text from every page (with whitespace/ligature flags); used inside extract_pdf_text. |
extract_pdf_text(
pdf_path,
output_dir=None,
write_files=True,
rev_db=None,
*,
verbose=False,
) -> dict| Return key | Type | Description |
|---|---|---|
raw |
str |
Text extracted before patching. |
patched |
str |
Text extracted after ToUnicode merge. |
stats |
dict |
fonts_seen, patched, upgrades, no_change, no_match. |
diff_lines |
list |
Line indices and raw/patched pairs where lines differ. |
char_delta |
int |
len(patched) - len(raw). |
If write_files is true (default), writes {stem}.raw.txt, {stem}.patched.txt, {stem}.diff.txt next to the PDF (or under output_dir).
patch_pdf(
pdf_path,
output_path=None,
write_file=True,
rev_db=None,
*,
verbose=False,
) -> dict| Return key | Description |
|---|---|
pdf_bytes |
Patched PDF as bytes. |
stats |
Same counters as above. |
output_path |
Path where the file was written, or None if write_file=False. |
Default output path: {stem}.patched.pdf beside the input.
build_tounicode_dict(pdf_path, rev_db=None) -> dictReturns fonts (list of per-font records), by_font_xref (dict keyed by xref string), and stats. See docs/glossary-and-json.md for field-level documentation.
collect_font_merges(doc: fitz.Document, rev_db: dict, *, verbose=False)
-> tuple[list[dict], dict]Returns (records, stats). Each record includes font_xref, to_unicode_xref, pdf_font_name, db_key_matched, existing, merged, overrides, changed.
patch_doc(doc: fitz.Document, rev_db: dict, *, verbose=False) -> dict[str, int]Mutates doc in place (writes ToUnicode streams where changed > 0). Returns stats.
extract_all(doc: fitz.Document) -> strFull-document text with page banners (=== PAGE n ===). Used internally after patching.
The file pdf_cmap_fix/data/reverse_db.json ships with the package (~16 MB on disk as of the build below). It maps normalised font key → { GID string → Unicode string }, built offline from TrueType/OpenType sources using cmap + GSUB type-4 ligature decomposition (see scripts/build_reverse_db.py).
| Property | Value |
|---|---|
| Build date | 2026-04-28 |
| Font entries (keys) | 962 |
| Full key list | docs/font-inventory.md |
Sources were combined in order; later archives override earlier entries when the normalised font key collides (same stem after lowercasing and stripping non-alphanumeric characters):
scripts/bodyig.zip— legacy “bodyig”-style corpus bundled with this repo for reproducibility.scripts/tibetan-fonts-main.zip— snapshot of the public OpenPechatibetan-fontsmainbranch (downloaded as ZIP).scripts/tibetan-fonts-private-main.zip— snapshot of the private Tibetan fonts repomainbranch (downloaded as ZIP).
Command used (from repository root):
python scripts/build_reverse_db.py ^
--zip scripts/bodyig.zip ^
--zip scripts/tibetan-fonts-main.zip ^
--zip scripts/tibetan-fonts-private-main.zip ^
-o pdf_cmap_fix/data/reverse_db.jsonWhy ZIP instead of git clone? On Windows, cloning large font repositories can fail when paths contain characters NTFS rejects (for example :). Reading .ttf / .otf directly from ZIP files avoids extracting those paths to disk and matches how CI or contributors can refresh the database without a full checkout.
When upstream font repositories add or change faces:
- Download fresh
mainZIP archives (or clone on Linux/macOS / WSL if you prefer--fonts-dir). - Re-run
build_reverse_db.pywith the same--ziporder (or adjust order deliberately if you want a different precedence). - Replace
pdf_cmap_fix/data/reverse_db.jsonand record the new build date in this README (and optionally inCHANGELOG.md). - Regression-test on known PDFs (for example under
docs/examples/) before tagging a release.
Optional inputs:
pip install fonttools
python scripts/build_reverse_db.py --fonts-dir path/to/fonts -o pdf_cmap_fix/data/reverse_db.json
python scripts/build_reverse_db.py --zip scripts/bodyig.zip --fonts-dir ../more-fonts -o out.jsonIf you omit --zip and --fonts-dir, the script defaults to scripts/bodyig.zip when that file exists.
See also Rebuild notes in CHANGELOG.md.
| Old (removed) | New (0.2.0) |
|---|---|
PyPI / import tibetan_pdf_fix |
pdf_cmap_fix |
CLI tibetan-pdf-fix |
pdf-cmap-fix |
extract_tibetan_pdf(...) |
extract_pdf_text(...) |
patch_tibetan_pdf(...) |
patch_pdf(...) |
| (new) | build_tounicode_dict(...) — merged CMaps as dicts without patching PDF bytes |
pip install … same git URL |
Package name pdf-cmap-fix |
There is no compatibility shim: update imports and the CLI name.
The bundled database covers 962 normalised font keys drawn from the archives above (see docs/font-inventory.md). Only Type0 / CID / Identity-H fonts are handled (PDF character code = original GID in the subset). TrueType simple-encoding PDFs (typical of some Ghostscript workflows) are not supported by this path.
- Match each embedded Type0 font name to an entry in
reverse_db.json. - Parse the PDF’s existing ToUnicode CMap.
- Merge: the database replaces entries wherever it has a GID mapping (GSUB-derived mappings are treated as authoritative).
- Optionally write streams back (
patch_pdf/extract_pdf_text) or only return dicts (build_tounicode_dict).
Details: docs/approach.md.
| Before (wrong) | After (correct) |
|---|---|
ཀོང་ཡངས་རོལ་བའི་རྣལ་འབོར་པ་ |
ཀློང་ཡངས་རོལ་བའི་རྣལ་འབྱོར་པ་ |
རོ་རེའི་སེ་ཕེང་ |
རྡོ་རྗེའི་སྐྱེ་ཕྲེང་ |
Outputs: docs/examples/TI1751-01-001/
| Before (wrong) | After (correct) |
|---|---|
བྗོད་གངས་ཅན་ |
བོད་གངས་ཅན་ |
ཐྗོས་བསམ་སྗོམ་གསུམ་ |
ཐོས་བསམ་སྒོམ་གསུམ་ |
Outputs: docs/examples/TI1055-01-001/
The pipeline is not Tibetan-specific: any Identity-H Type0 font whose glyph IDs align with a font used to build reverse_db.json can be fixed the same way. For a minimal Latin test, build a tiny database from a font with fi/fl ligatures and validate non-empty overrides on a deliberately broken PDF.
pdf_cmap_fix/ Python package
├── extractor.py Patch / extract / build_tounicode_dict / CLI
└── data/
└── reverse_db.json GID → Unicode (bundled; regenerate via scripts)
scripts/
├── font_sources.py Zip + directory font enumeration
├── build_reverse_db.py Rebuild reverse_db.json (Windows-safe UTF-8 logging)
└── build_glyph_db.py Deprecated — use build_reverse_db.py
docs/
├── README.md Documentation index
├── glossary-and-json.md Terms + JSON shapes
├── font-inventory.md All 962 bundled font keys
├── approach.md Design / pipeline
├── blog.md Draft / notes
└── examples/ Example PDFs and outputs
tests/ pytest (optional [dev] install)
MIT