feat: add Mistral Document AI for document parsing by r-dh · Pull Request #175 · superlinear-ai/raglite

r-dh · 2026-01-23T15:34:37Z

Adds Mistral as an optional document processor with pip install raglite[mistral-ocr].

Converts PDFs, images, .docx, .pptx, and more to Markdown via Mistral's OCR API. Uses bbox_annotation_format to classify and describe images found in documents, making visual content searchable as text.

Image categories are user-defined, Mistral constrains its classification to whatever types are provided:

RAGLiteConfig(
      document_processor=MistralOCRConfig(
          image_types=frozenset({"chart", "diagram", "photo", "table", "icon", "logo"}),
          exclude_image_types=frozenset({"icon", "logo"}),
      ),
)

Falls back to the default pdftext/pandoc processor for unsupported file types.

Note that this includes a onnxruntime pin for Python 3.10 and a ruff <0.15 cap, the latter which will be resolved in a separate PR.

emilradix · 2026-01-26T15:31:49Z

@r-dh I marked this as draft, as I understood from your description it is not ready yet to be merged.

emilradix · 2026-02-09T09:35:54Z

Is this ready for review? In that case you can mark it as ready for review, currently its a draft

r-dh · 2026-02-10T08:33:06Z

I had to pin ruff to make the tests pass, because the ruff 15.0 release brings a myriad of new linting issues across the repo that should be either addressed in a separate branch, or avoided altogether.

There has also been a new onnxruntime library release in the meantime that is no longer compatible with Python 3.10, so I pinned that as well.

emilradix · 2026-02-10T08:38:18Z

@MattiaMolon Do you mind reviewing?

@r-dh Can you open a seperate PR that deals with ruff linting for the rest of the repo? Maybe you can let codex run to deal with a lot of those issues?

MattiaMolon · 2026-02-10T08:46:34Z

Sure, I will review this

MattiaMolon

LGTM!
Left 2 small comments for you @r-dh to evaluate if worth changing.

emilradix · 2026-02-11T13:34:52Z

@r-dh can you update the PR description to be accurate with the current status? Thanks!

r-dh · 2026-02-11T14:03:51Z

This reminds me I should probably update the README.md as well to reflect these changes. Let me do this first and then update the content of this PR.

r-dh · 2026-02-12T21:06:02Z

I made one more change so end users can more easily define their own images types that they (don't) want to process.

emilradix · 2026-02-16T14:41:44Z

@r-dh Please remove the ruff pin and I will merge

…rror

…ertions

r-dh · 2026-02-16T18:02:21Z

@emilradix Ready to merge

r-dh force-pushed the rd-mistral-ocr branch from 3656755 to b8b91c9 Compare January 23, 2026 15:44

emilradix marked this pull request as draft January 26, 2026 15:31

r-dh force-pushed the rd-mistral-ocr branch from b8b91c9 to cd86253 Compare January 27, 2026 10:02

r-dh force-pushed the rd-mistral-ocr branch 2 times, most recently from 3a3cb9c to 40b8573 Compare February 9, 2026 22:22

r-dh marked this pull request as ready for review February 10, 2026 08:30

MattiaMolon self-requested a review February 10, 2026 08:46

MattiaMolon reviewed Feb 10, 2026

View reviewed changes

Comment thread src/raglite/_mistral_ocr.py

MattiaMolon reviewed Feb 10, 2026

View reviewed changes

Comment thread tests/test_mistral_ocr.py

MattiaMolon reviewed Feb 10, 2026

View reviewed changes

emilradix reviewed Feb 11, 2026

View reviewed changes

Comment thread pyproject.toml

emilradix reviewed Feb 11, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

r-dh changed the title ~~Add Mistral Document AI for document parsing~~ feat: add Mistral Document AI for document parsing Feb 13, 2026

emilradix mentioned this pull request Feb 16, 2026

fix: resolve ruff 0.15.0 lint violations #178

Merged

r-dh added 7 commits February 16, 2026 18:29

feat: support Mistral Document AI for document processing

77922b8

chore: clean up code

1426362

feat: expose Mistral model explicitly, clarify dependency on import e…

ebda3aa

…rror

feat: add logging for unsupported file types in document processing

a55d64e

refactor: streamline MistralOCR tests

ba76313

chore: specify type for metadata to satisfy mypy

94373ee

test: add NVIDIA report PDF for Mistral OCR testing with enhanced ass…

94e6f3b

…ertions

r-dh added 4 commits February 16, 2026 18:29

chore: add python-dotenv to development dependencies

4e9bda0

refactor: streamline OCR processing logic

6145552

feat: allow custom image type categories in MistralOCRConfig

c9f13a3

docs: add Mistral OCR usage documentation to README

f1abc8b

r-dh force-pushed the rd-mistral-ocr branch from 0f9a274 to f1abc8b Compare February 16, 2026 17:29

emilradix merged commit 54f4f80 into main Feb 17, 2026
4 checks passed

emilradix deleted the rd-mistral-ocr branch February 17, 2026 11:53

Conversation

r-dh commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emilradix commented Jan 26, 2026

Uh oh!

emilradix commented Feb 9, 2026

Uh oh!

r-dh commented Feb 10, 2026

Uh oh!

emilradix commented Feb 10, 2026

Uh oh!

MattiaMolon commented Feb 10, 2026

Uh oh!

Uh oh!

Uh oh!

MattiaMolon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emilradix commented Feb 11, 2026

Uh oh!

r-dh commented Feb 11, 2026

Uh oh!

r-dh commented Feb 12, 2026

Uh oh!

emilradix commented Feb 16, 2026

Uh oh!

r-dh commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r-dh commented Jan 23, 2026 •

edited

Loading

MattiaMolon left a comment •

edited

Loading