Reading long, visually–rich PDFs is hard for both humans and machines. Heading hierarchies, repeated headers/footers and tables break naïve line-by-line extraction and traditional approaches frequently rely only on font size – a fragile signal that fails on scanned, multi-column or stylistically diverse documents.
Our goal: Convert arbitrary PDFs into a clean JSON schema
{
"title": "…",
"outline": [ {"level": "H1|H2|H3", "text": "…", "page": n}, … ]
}robustly, without depending solely on typography.
- Low-Level Parsing – PyMuPDF (
fitz) gives us every span/line/block with positions, size, font-face, bold/italic flags, etc. - Semantic Grouping – we merge spans into logical lines, then into blocks while tracking the geometric height of each line (
bbox). - Multi-Signal Heading Detection
- Height histogram → candidate header classes
h1‥hN - Bold-ness & position heuristics break ties.
- Title chosen by combination of page, y-coordinate (< 60 % page height) and boldness – not raw size.
- Height histogram → candidate header classes
- Noise Removal
- Repeated headers / footers detected by (text, page-frequency, y-zone) statistics.
- Table areas located with span-count heuristics and purged.
- Single-char, digits-only, punctuation-only blocks discarded.
- Content Grounding – We re-map each heading back to the closest full-text chunk produced by our Markdown converter (
pymupdf_rag_original.py) to guarantee semantic completeness – removing residual font-size dependence. - JSON Translation – height classes (
title,h1,h2…) → schema levels (H1,H2,H3). - Batch Utility –
process_pdfs_directoryruns the full pipeline over a folder, writing results tooutput_jsons/.
Font size correlates with hierarchy but is never sufficient. We explicitly counter its pitfalls:
- Boldness Check – If an
h2is bold and sits above a non-boldh1, we may promote it. - Positional Rules – Headers expected near top of page; same font appearing mid-page is down-ranked.
- Repetition Filter – Constant font-size strings on every page are headers/footers, not document headings.
- Table & Graphic Zones – Tables often use body font size yet must be ignored; geometry–based table detection sidesteps size.
- Chunk Remapping – Final text is replaced by full paragraph chunks that contain all heading tokens irrespective of styling.
These layered checks let us survive scanned PDFs (uniform OCR font), multi-column brochures and design-heavy documents.
graph TD;
A[PDF] --> B(Extract lines & spans) --> C{Height histogram};
C -->|merge| D[Blocks with h-classes];
D --> E[Filter tables];
E --> F[Remove repeating header/footer];
F --> G(Title & heading logic);
G --> H[Clean & custom filters];
H --> I[Map to Markdown chunks];
I --> J[Schema JSON];
main_pipeline_pdfs.py– orchestrates the end-to-end PDF → JSON conversion.helpers/get_text_lines.py– natural line reconstruction.multi_column.py– multi-column detection, returns ordered text boxes.progress.py– lightweight CLI progress bar.pymupdf_rag_original.py– battle-tested Markdown (RAG-friendly) extractor.
to_chunk_json.py– utility to pre-chunk a PDF into clean text segments.output_jsons/– sample results.graph1.png,graph2.png– visual aids (placeholders above).
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt # see belowPyMuPDF>=1.25.5
numpy
The project includes a Dockerfile for containerized execution. This is the recommended approach for consistent environments.
- Docker Desktop installed and running
- PDF files placed in the
input/directory
From the project root directory:
# Build the Docker image
docker build -t pdf-pipeline .
# Run the container (Windows PowerShell)
docker run --rm `
--mount type=bind,source="${PWD}\input",target=/app/input `
--mount type=bind,source="${PWD}\output",target=/app/output `
pdf-pipelinedocker build -t pdf-pipeline .; docker run --rm --mount type=bind,source="${PWD}\input",target=/app/input --mount type=bind,source="${PWD}\output",target=/app/output pdf-pipeline# Build
docker build -t pdf-pipeline .
# Run
docker run --rm \
-v "$(pwd)/input:/app/input" \
-v "$(pwd)/output:/app/output" \
pdf-pipeline- The container processes all PDF files from the mounted
input/directory - JSON output files are written to the mounted
output/directory - The container is automatically removed after completion (
--rmflag)
from main_pipeline_pdfs import process_pdfs_directory
input_dir = "./sample_dataset/pdfs"
output_dir = "./output_jsons"
process_pdfs_directory(input_dir, output_dir)The convenience CLI is commented in main_pipeline_pdfs.py – simply uncomment and run.
- Bins for height histogram – change
binsinprocess_pdf_to_json. - Repeating header ratio – tune
min_repeat_ratioinremove_repeated_headers_footers. - Table span thresholds – adjust
span_threshold,line_threshold. - Custom levels – map additional
hNto schema by editingtranslate_to_schema_format.
See output_jsons/ for five real-world PDFs – each produced with zero manual cleaning.

