pdf-cmap-fix

Fix missing or wrong PDF /ToUnicode CMap entries so text extraction and copy-paste match what you see on the page. The primary use case is Tibetan stacked syllables (Monlam, Himalaya, Jomolhari): producers often embed incomplete or incorrect Unicode mappings for ligature glyphs. The same mechanism applies to any Type0 / Identity-H font present in the bundled GSUB-derived database.

GitHub: OpenPecha/pdf-cmap-fix

Documentation: docs/README.md · Glossary & JSON formats · Approach · Font inventory (962 keys)

Installation

Requires Python 3.8+, PyMuPDF (fitz), and fontTools (declared in pyproject.toml).

From Git (recommended for latest `main`)

Install the package directly from this repository:

pip install "pdf-cmap-fix @ git+https://github.com/OpenPecha/pdf-cmap-fix.git"

Equivalent shorthand:

pip install git+https://github.com/OpenPecha/pdf-cmap-fix.git

Editable checkout for development:

git clone https://github.com/OpenPecha/pdf-cmap-fix.git
cd pdf-cmap-fix
pip install -e ".[dev]"

Run tests:

pytest

After install, the CLI pdf-cmap-fix is on your PATH. The bundled database ships inside the wheel/sdist as package data (pdf_cmap_fix/data/reverse_db.json).

Quick start (CLI)

pdf-cmap-fix document.pdf
# writes: document.raw.txt  document.patched.txt  document.diff.txt

pdf-cmap-fix doc1.pdf doc2.pdf doc3.pdf

Patched PDF only (same ToUnicode logic; does not overwrite the input):

pdf-cmap-fix --patch-pdf document.pdf
# writes: document.patched.pdf
pdf-cmap-fix -p doc1.pdf doc2.pdf   # short form

Dump merged ToUnicode data as JSON (does not modify the PDF):

pdf-cmap-fix --dump-cmap cmap.json document.pdf
# multiple PDFs → cmap_<stem>.json per file

Large PDFs with many Type0 font objects can make --dump-cmap slow and the JSON huge; prefer build_tounicode_dict in Python if you need to filter by font name or xref.

On Windows, the default CLI prints Tibetan previews using the console encoding; if you see UnicodeEncodeError, switch the terminal to UTF-8 (for example chcp 65001) or call extract_pdf_text(..., verbose=False) from Python so nothing is printed to the console.

Python API reference

Import the public API from pdf_cmap_fix:

from pdf_cmap_fix import (
    extract_pdf_text,
    patch_pdf,
    build_tounicode_dict,
    collect_font_merges,
    patch_doc,
    extract_all,
)

Optional: load a custom JSON database with json.loads(...) and pass as rev_db= where supported.

Function	Purpose
`extract_pdf_text`	Opens the PDF twice: extract raw text, then patch ToUnicode in memory and extract again. Can write `.raw.txt`, `.patched.txt`, `.diff.txt`.
`patch_pdf`	Applies merged ToUnicode streams and returns bytes (and optionally writes `*.patched.pdf`).
`build_tounicode_dict`	No PDF mutation: returns per-font `existing` / `merged` / `overrides` plus `stats`.
`collect_font_merges`	Lower-level: scan the document and compute merge records without writing streams.
`patch_doc`	Apply merges to an already-open `fitz.Document` using `collect_font_merges` + stream updates.
`extract_all`	Extract plain text from every page (with whitespace/ligature flags); used inside `extract_pdf_text`.

`extract_pdf_text`

extract_pdf_text(
    pdf_path,
    output_dir=None,
    write_files=True,
    rev_db=None,
    *,
    verbose=False,
) -> dict

Return key	Type	Description
`raw`	`str`	Text extracted before patching.
`patched`	`str`	Text extracted after ToUnicode merge.
`stats`	`dict`	`fonts_seen`, `patched`, `upgrades`, `no_change`, `no_match`.
`diff_lines`	`list`	Line indices and raw/patched pairs where lines differ.
`char_delta`	`int`	`len(patched) - len(raw)`.

If write_files is true (default), writes {stem}.raw.txt, {stem}.patched.txt, {stem}.diff.txt next to the PDF (or under output_dir).

`patch_pdf`

patch_pdf(
    pdf_path,
    output_path=None,
    write_file=True,
    rev_db=None,
    *,
    verbose=False,
) -> dict

Return key	Description
`pdf_bytes`	Patched PDF as `bytes`.
`stats`	Same counters as above.
`output_path`	`Path` where the file was written, or `None` if `write_file=False`.

Default output path: {stem}.patched.pdf beside the input.

`build_tounicode_dict`

build_tounicode_dict(pdf_path, rev_db=None) -> dict

Returns fonts (list of per-font records), by_font_xref (dict keyed by xref string), and stats. See docs/glossary-and-json.md for field-level documentation.

`collect_font_merges`

collect_font_merges(doc: fitz.Document, rev_db: dict, *, verbose=False)
-> tuple[list[dict], dict]

Returns (records, stats). Each record includes font_xref, to_unicode_xref, pdf_font_name, db_key_matched, existing, merged, overrides, changed.

`patch_doc`

patch_doc(doc: fitz.Document, rev_db: dict, *, verbose=False) -> dict[str, int]

Mutates doc in place (writes ToUnicode streams where changed > 0). Returns stats.

`extract_all`

extract_all(doc: fitz.Document) -> str

Full-document text with page banners (=== PAGE n ===). Used internally after patching.

Bundled reverse database (font sources)

The file pdf_cmap_fix/data/reverse_db.json ships with the package (~16 MB on disk as of the build below). It maps normalised font key → { GID string → Unicode string }, built offline from TrueType/OpenType sources using cmap + GSUB type-4 ligature decomposition (see scripts/build_reverse_db.py).

Property	Value
Build date	2026-04-28
Font entries (keys)	962
Full key list	docs/font-inventory.md

How this copy was produced

Sources were combined in order; later archives override earlier entries when the normalised font key collides (same stem after lowercasing and stripping non-alphanumeric characters):

scripts/bodyig.zip — legacy “bodyig”-style corpus bundled with this repo for reproducibility.
scripts/tibetan-fonts-main.zip — snapshot of the public OpenPecha tibetan-fonts main branch (downloaded as ZIP).
scripts/tibetan-fonts-private-main.zip — snapshot of the private Tibetan fonts repo main branch (downloaded as ZIP).

Command used (from repository root):

python scripts/build_reverse_db.py ^
  --zip scripts/bodyig.zip ^
  --zip scripts/tibetan-fonts-main.zip ^
  --zip scripts/tibetan-fonts-private-main.zip ^
  -o pdf_cmap_fix/data/reverse_db.json

Why ZIP instead of git clone? On Windows, cloning large font repositories can fail when paths contain characters NTFS rejects (for example :). Reading .ttf / .otf directly from ZIP files avoids extracting those paths to disk and matches how CI or contributors can refresh the database without a full checkout.

Updating `reverse_db.json` in the future

When upstream font repositories add or change faces:

Download fresh main ZIP archives (or clone on Linux/macOS / WSL if you prefer --fonts-dir).
Re-run build_reverse_db.py with the same --zip order (or adjust order deliberately if you want a different precedence).
Replace pdf_cmap_fix/data/reverse_db.json and record the new build date in this README (and optionally in CHANGELOG.md).
Regression-test on known PDFs (for example under docs/examples/) before tagging a release.

Optional inputs:

pip install fonttools
python scripts/build_reverse_db.py --fonts-dir path/to/fonts -o pdf_cmap_fix/data/reverse_db.json
python scripts/build_reverse_db.py --zip scripts/bodyig.zip --fonts-dir ../more-fonts -o out.json

If you omit --zip and --fonts-dir, the script defaults to scripts/bodyig.zip when that file exists.

See also Rebuild notes in CHANGELOG.md.

Migration from `tibetan-pdf-fix` (0.1.x)

Old (removed)	New (0.2.0)
PyPI / import `tibetan_pdf_fix`	`pdf_cmap_fix`
CLI `tibetan-pdf-fix`	`pdf-cmap-fix`
`extract_tibetan_pdf(...)`	`extract_pdf_text(...)`
`patch_tibetan_pdf(...)`	`patch_pdf(...)`
(new)	`build_tounicode_dict(...)` — merged CMaps as dicts without patching PDF bytes
`pip install …` same git URL	Package name `pdf-cmap-fix`

There is no compatibility shim: update imports and the CLI name.

Supported fonts & limits

The bundled database covers 962 normalised font keys drawn from the archives above (see docs/font-inventory.md). Only Type0 / CID / Identity-H fonts are handled (PDF character code = original GID in the subset). TrueType simple-encoding PDFs (typical of some Ghostscript workflows) are not supported by this path.

How it works

Match each embedded Type0 font name to an entry in reverse_db.json.
Parse the PDF’s existing ToUnicode CMap.
Merge: the database replaces entries wherever it has a GID mapping (GSUB-derived mappings are treated as authoritative).
Optionally write streams back (patch_pdf / extract_pdf_text) or only return dicts (build_tounicode_dict).

Details: docs/approach.md.

Example results

TI1751-01-001.pdf — InDesign PDF, 528 pages

Before (wrong)	After (correct)
`ཀོང་ཡངས་རོལ་བའི་རྣལ་འབོར་པ་`	`ཀློང་ཡངས་རོལ་བའི་རྣལ་འབྱོར་པ་`
`རོ་རེའི་སེ་ཕེང་`	`རྡོ་རྗེའི་སྐྱེ་ཕྲེང་`

Outputs: docs/examples/TI1751-01-001/

TI1055-01-001.pdf — Word PDF, 528 pages

Before (wrong)	After (correct)
`བྗོད་གངས་ཅན་`	`བོད་གངས་ཅན་`
`ཐྗོས་བསམ་སྗོམ་གསུམ་`	`ཐོས་བསམ་སྒོམ་གསུམ་`

Outputs: docs/examples/TI1055-01-001/

Beyond Tibetan (smoke test)

The pipeline is not Tibetan-specific: any Identity-H Type0 font whose glyph IDs align with a font used to build reverse_db.json can be fixed the same way. For a minimal Latin test, build a tiny database from a font with fi/fl ligatures and validate non-empty overrides on a deliberately broken PDF.

Project structure

pdf_cmap_fix/              Python package
├── extractor.py           Patch / extract / build_tounicode_dict / CLI
└── data/
    └── reverse_db.json    GID → Unicode (bundled; regenerate via scripts)
scripts/
├── font_sources.py        Zip + directory font enumeration
├── build_reverse_db.py    Rebuild reverse_db.json (Windows-safe UTF-8 logging)
└── build_glyph_db.py      Deprecated — use build_reverse_db.py
docs/
├── README.md              Documentation index
├── glossary-and-json.md   Terms + JSON shapes
├── font-inventory.md      All 962 bundled font keys
├── approach.md            Design / pipeline
├── blog.md                Draft / notes
└── examples/              Example PDFs and outputs
tests/                     pytest (optional [dev] install)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-cmap-fix

Table of contents

Installation

From Git (recommended for latest `main`)

Quick start (CLI)

Python API reference

`extract_pdf_text`

`patch_pdf`

`build_tounicode_dict`

`collect_font_merges`

`patch_doc`

`extract_all`

Bundled reverse database (font sources)

How this copy was produced

Updating `reverse_db.json` in the future

Migration from `tibetan-pdf-fix` (0.1.x)

Supported fonts & limits

How it works

Example results

TI1751-01-001.pdf — InDesign PDF, 528 pages

TI1055-01-001.pdf — Word PDF, 528 pages

Beyond Tibetan (smoke test)

Project structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
pdf_cmap_fix		pdf_cmap_fix
scripts		scripts
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

pdf-cmap-fix

Table of contents

Installation

From Git (recommended for latest main)

Quick start (CLI)

Python API reference

extract_pdf_text

patch_pdf

build_tounicode_dict

collect_font_merges

patch_doc

extract_all

Bundled reverse database (font sources)

How this copy was produced

Updating reverse_db.json in the future

Migration from tibetan-pdf-fix (0.1.x)

Supported fonts & limits

How it works

Example results

TI1751-01-001.pdf — InDesign PDF, 528 pages

TI1055-01-001.pdf — Word PDF, 528 pages

Beyond Tibetan (smoke test)

Project structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

From Git (recommended for latest `main`)

`extract_pdf_text`

`patch_pdf`

`build_tounicode_dict`

`collect_font_merges`

`patch_doc`

`extract_all`

Updating `reverse_db.json` in the future

Migration from `tibetan-pdf-fix` (0.1.x)

Packages