langchain-kreuzberg is a LangChain document loader that wraps Kreuzberg's extraction API. It supports 75+ file formats out of the box, provides true async extraction powered by Rust's tokio runtime, and produces LangChain Document objects enriched with rich metadata including detected languages, quality scores, and extracted keywords.
pip install langchain-kreuzbergRequires Python 3.10+.
from langchain_kreuzberg import KreuzbergLoader
loader = KreuzbergLoader(file_path="report.pdf")
docs = loader.load()
print(docs[0].page_content[:200])
print(docs[0].metadata["source"])- 75+ file formats -- PDF, DOCX, PPTX, XLSX, images, HTML, Markdown, plain text, and many more
- True async -- native async extraction backed by Rust's tokio runtime; no thread-pool workarounds
- Rich metadata -- title, author, page count, detected languages, quality score, extracted keywords, and more
- OCR with 3 backends -- Tesseract, EasyOCR, and PaddleOCR with configurable language support
- Per-page splitting -- yield one
Documentper page for fine-grained RAG pipelines - Bytes input -- load documents directly from raw bytes (e.g., API responses, S3 objects)
- Output format selection -- choose between plain text, Markdown, Djot, HTML, or structured output
from langchain_kreuzberg import KreuzbergLoader
loader = KreuzbergLoader(file_path="contract.pdf")
docs = loader.load()loader = KreuzbergLoader(
file_path=["report.pdf", "notes.docx", "data.xlsx"],
)
docs = loader.load()from kreuzberg import ExtractionConfig, OcrConfig
config = ExtractionConfig(
force_ocr=True,
ocr=OcrConfig(backend="tesseract", language="eng"),
)
loader = KreuzbergLoader(
file_path="scanned.pdf",
config=config,
)
docs = loader.load()loader = KreuzbergLoader(
file_path="./documents/",
glob="**/*.pdf",
)
docs = loader.load()from kreuzberg import ExtractionConfig, PageConfig
config = ExtractionConfig(pages=PageConfig(extract_pages=True))
loader = KreuzbergLoader(
file_path="handbook.pdf",
config=config,
)
docs = loader.load()
# docs[0].metadata["page"] == 0 (zero-indexed)import httpx
response = httpx.get("https://example.com/report.pdf")
loader = KreuzbergLoader(
data=response.content,
mime_type="application/pdf",
)
docs = loader.load()from kreuzberg import ExtractionConfig, OcrConfig, PageConfig
config = ExtractionConfig(
output_format="markdown",
ocr=OcrConfig(backend="easyocr", language="deu"),
force_ocr=True,
pages=PageConfig(extract_pages=True),
)
loader = KreuzbergLoader(
file_path="report.pdf",
config=config,
)
docs = loader.load()import asyncio
from langchain_kreuzberg import KreuzbergLoader
async def main():
loader = KreuzbergLoader(file_path="report.pdf")
docs = await loader.aload()
print(f"Loaded {len(docs)} documents")
asyncio.run(main())from langchain_kreuzberg import KreuzbergLoaderExtends langchain_core.document_loaders.BaseLoader.
All parameters are keyword-only.
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str | Path | list[str | Path] | None |
None |
File path, list of file paths, or directory path to load. |
data |
bytes | None |
None |
Raw bytes to extract text from. Mutually exclusive with file_path. |
mime_type |
str | None |
None |
MIME type hint. Required when using data, optional for file_path. |
glob |
str | None |
None |
Glob pattern for directory loading. |
config |
ExtractionConfig | None |
None |
Kreuzberg ExtractionConfig for controlling extraction behavior (output format, OCR settings, page splitting, etc.). See the Kreuzberg repository for all options. |
| Method | Return Type | Description |
|---|---|---|
load() |
list[Document] |
Load all documents into memory. |
lazy_load() |
Iterator[Document] |
Lazily yield documents one at a time (synchronous). |
aload() |
list[Document] |
Load all documents asynchronously. |
alazy_load() |
AsyncIterator[Document] |
Lazily yield documents one at a time (asynchronous). |
Each Document produced by KreuzbergLoader includes the following metadata fields (when available):
| Field | Type | Description |
|---|---|---|
source |
str |
File path or bytes://<mime_type> for bytes input. |
mime_type |
str |
Detected or provided MIME type. |
page_count |
int |
Total number of pages in the document. |
output_format |
str |
The output format used for extraction. |
quality_score |
float |
Extraction quality score (0.0 -- 1.0). |
detected_languages |
list[str] |
Languages detected in the document. |
extracted_keywords |
list[dict] |
Keywords with text, score, and algorithm fields. |
table_count |
int |
Number of tables found in the document. |
tables |
list[dict] |
Table data with cells, markdown, and page_number fields. |
processing_warnings |
list[str] |
Any warnings generated during extraction. |
page |
int |
Zero-indexed page number (only present in per-page mode). |
is_blank |
bool |
Whether the page is blank (only present in per-page mode). |
title |
str |
Document title (from file metadata). |
author |
str |
Document author (from file metadata). |
subject |
str |
Document subject (from file metadata). |
creator |
str |
Application that created the document. |
producer |
str |
Application that produced the document. |
creation_date |
str |
Document creation date. |
modification_date |
str |
Document last modification date. |
Additional metadata fields from Kreuzberg's document-level metadata are flattened into the metadata dict when present.
Kreuzberg supports 75+ file formats including PDF, DOCX, images (via OCR), spreadsheets, presentations, HTML, Markdown, and many more. For the full and up-to-date list of supported formats, see the Kreuzberg repository.
This project uses uv for dependency management.
# Clone the repository
git clone https://github.com/kreuzberg-dev/langchain-kreuzberg.git
cd langchain-kreuzberg
# Install dependencies (including dev group)
uv sync
# Run linting
uv run ruff check .
uv run ruff format --check .
uv run mypy .
# Run unit tests
uv run pytest --cov
# Run integration tests (real file extraction, no mocks)
uv run pytest -m integration -v
# Install pre-commit hooks
prek installThis project is licensed under the MIT License.