Framework document parsing powered by Vision Language Models (VLMs) and OCR.
Warning
This project is still under active development and is not ready for production environments. The API, code structure, and behavior may change at any time without prior notice. Use only in development or experimental environments.
DocVision Parser is a Python library for extracting high-quality structured text and markdown from documents (images and PDFs). It combines PaddleOCR ONNX for fast, offline text extraction with the reasoning power of Vision Language Models (GPT-4o, Claude, Llama, etc.).
Three parsing modes:
| Mode | Best For | Requires |
|---|---|---|
| BASIC_OCR | Fast offline extraction, no GPU needed | — |
| VLM | Complex layouts, handwriting, mixed content | VLM API key |
| AGENTIC | Long documents, dense tables, self-correcting | VLM API key |
BASIC_OCRmode — PaddleOCR ONNX via RapidOCR, models auto-downloaded from HuggingFace on first use. No PyTorch, no GPU required.- Dual preprocessing pipeline —
preprocess_for_ocr(CLAHE, deskew, DPI normalization) andpreprocess_for_vlm(adaptive resize, rotation, crop) are now separate optimized pipelines. - Agentic reflect pattern — Critic/refiner replace the old repetition-detection loop. Critic uses Pydantic structured output for reliable evaluation.
- Multi-language OCR — English, Latin (ID/FR/DE/ES), Chinese, Korean, Arabic, Hindi, Tamil, Telugu.
- Breaking:
ParsingMode.PDFrenamed toParsingMode.BASIC_OCR. - Breaking:
process_image()replaced bypreprocess_for_ocr()/preprocess_for_vlm().
pip install docvisionOr using uv (recommended):
uv add docvisionNote: OCR models (~100MB) are downloaded automatically to
~/.cache/docvision/models/on first use.
import asyncio
from docvision import DocumentParser, ParsingMode
async def main():
parser = DocumentParser(
ocr_language="english", # or "latin" for Indonesian/European
)
# Parse a single image
result = await parser.parse_image("document.jpg", parsing_mode=ParsingMode.BASIC_OCR)
print(result.content)
# Parse a PDF
results = await parser.parse_pdf("report.pdf", parsing_mode=ParsingMode.BASIC_OCR)
for page in results:
print(f"Page {page.metadata['page_number']}:\n{page.content}")
asyncio.run(main())from docvision import DocumentParser, ParsingMode
async def main():
parser = DocumentParser(
base_url="https://api.openai.com/v1",
model_name="gpt-4o-mini",
api_key="your_api_key",
)
result = await parser.parse_image("scanned.jpg", parsing_mode=ParsingMode.VLM)
print(result.content)async def main():
parser = DocumentParser(
base_url="https://api.openai.com/v1",
model_name="gpt-4o",
api_key="your_api_key",
max_reflect_cycles=2, # critic→refine cycles per page (default: 2, max recommended: 2)
)
results = await parser.parse_pdf(
"dense_report.pdf",
parsing_mode=ParsingMode.AGENTIC,
start_page=1,
end_page=10,
)
for page in results:
print(f"Page {page.metadata['page_number']} "
f"(critic score: {page.metadata['final_critic_score']}):\n"
f"{page.content}")Extract data directly into Pydantic models using VLM mode.
from pydantic import BaseModel
from typing import List
class LineItem(BaseModel):
description: str
quantity: int
price: float
class Invoice(BaseModel):
invoice_no: str
total: float
items: List[LineItem]
parser = DocumentParser(
base_url="...",
model_name="gpt-4o",
api_key="...",
system_prompt="Extract all invoice fields accurately.",
)
result = await parser.parse_image("invoice.png", output_schema=Invoice)
# result.content is a JSON string of the validated Invoice
print(result.content)# Indonesian, French, German, Spanish, etc. → use "latin"
parser = DocumentParser(ocr_language="latin")
# Chinese, Korean, Arabic, Hindi, Tamil, Telugu
parser = DocumentParser(ocr_language="chinese")
# Custom model directory (skip auto-download)
parser = DocumentParser(
ocr_language="english",
ocr_model_dir="/path/to/models",
)# Save as Markdown
await parser.parse_pdf("input.pdf", save_path="output/result.md")
# Save as JSON
await parser.parse_pdf("input.pdf", save_path="output/result.json")
# Save to directory (auto-creates output.json inside)
await parser.parse_pdf("input.pdf", save_path="output/")parser = DocumentParser(
# VLM config (required for VLM and AGENTIC modes)
base_url="https://api.openai.com/v1",
model_name="gpt-4o",
api_key="your_key",
temperature=0.7,
max_tokens=4096,
system_prompt=None,
# Agentic config
max_reflect_cycles=2, # values > 2 emit UserWarning
# OCR config (for BASIC_OCR mode)
ocr_language="english", # see supported languages below
ocr_model_dir=None, # None = auto-download to ~/.cache/docvision/
# Image processing
enable_crop=True, # crop image to content
enable_rotate=True, # auto-correct orientation
enable_deskew=True, # correct small skew angles (OCR mode)
dpi=300, # PDF render DPI multiplier
post_crop_max_size=1024, # max image dimension for VLM input
max_concurrency=5, # max concurrent pages
debug_dir=None, # save debug images here
)| Value | Covers |
|---|---|
"english" |
English |
"latin" |
Indonesian, French, German, Spanish, Portuguese, and other Latin-script languages |
"chinese" |
Simplified + Traditional Chinese |
"korean" |
Korean |
"arabic" |
Arabic |
"hindi" |
Hindi (Devanagari) |
"tamil" |
Tamil |
"telugu" |
Telugu |
DocumentParser
├── VLMClient — async OpenAI-compatible API
├── OCREngine — PaddleOCR ONNX via RapidOCR, HuggingFace
├── ImageProcessor
│ ├── preprocess_for_ocr() — deskew, DPI normalization, CLAHE contrast
│ └── preprocess_for_vlm() — adaptive resize
└── AgenticWorkflow (LangGraph)
├── generate — initial VLM parse
├── critic — structural evaluation via Pydantic structured output
├── refine — targeted fix based on critic issues
└── complete — terminal node
Agentic reflect loop:
generate → critic ──(score ≥ 8 or max cycles)──→ complete → END
└──(score < 9)──→ refine → critic (loop)
# Setup
uv sync --dev
# Run tests
make test
# Lint & format
make lint
make formatApache 2.0. See LICENSE for details.
Fahmi Aziz Fadhil
- GitHub: @fahmiaziz98
- Email: fahmiazizfadhil09@gmail.com