Skip to content

feat: add Mistral Document AI for document parsing#175

Merged
emilradix merged 11 commits intomainfrom
rd-mistral-ocr
Feb 17, 2026
Merged

feat: add Mistral Document AI for document parsing#175
emilradix merged 11 commits intomainfrom
rd-mistral-ocr

Conversation

@r-dh
Copy link
Copy Markdown
Contributor

@r-dh r-dh commented Jan 23, 2026

Adds Mistral as an optional document processor with pip install raglite[mistral-ocr].

Converts PDFs, images, .docx, .pptx, and more to Markdown via Mistral's OCR API. Uses bbox_annotation_format to classify and describe images found in documents, making visual content searchable as text.

Image categories are user-defined, Mistral constrains its classification to whatever types are provided:

RAGLiteConfig(
      document_processor=MistralOCRConfig(
          image_types=frozenset({"chart", "diagram", "photo", "table", "icon", "logo"}),
          exclude_image_types=frozenset({"icon", "logo"}),
      ),
)

Falls back to the default pdftext/pandoc processor for unsupported file types.

Note that this includes a onnxruntime pin for Python 3.10 and a ruff <0.15 cap, the latter which will be resolved in a separate PR.

@emilradix emilradix marked this pull request as draft January 26, 2026 15:31
@emilradix
Copy link
Copy Markdown
Collaborator

@r-dh I marked this as draft, as I understood from your description it is not ready yet to be merged.

@emilradix
Copy link
Copy Markdown
Collaborator

Is this ready for review? In that case you can mark it as ready for review, currently its a draft

@r-dh r-dh force-pushed the rd-mistral-ocr branch 2 times, most recently from 3a3cb9c to 40b8573 Compare February 9, 2026 22:22
@r-dh r-dh marked this pull request as ready for review February 10, 2026 08:30
@r-dh
Copy link
Copy Markdown
Contributor Author

r-dh commented Feb 10, 2026

I had to pin ruff to make the tests pass, because the ruff 15.0 release brings a myriad of new linting issues across the repo that should be either addressed in a separate branch, or avoided altogether.

There has also been a new onnxruntime library release in the meantime that is no longer compatible with Python 3.10, so I pinned that as well.

@emilradix
Copy link
Copy Markdown
Collaborator

@MattiaMolon Do you mind reviewing?

@r-dh Can you open a seperate PR that deals with ruff linting for the rest of the repo? Maybe you can let codex run to deal with a lot of those issues?

@MattiaMolon
Copy link
Copy Markdown
Contributor

Sure, I will review this

@MattiaMolon MattiaMolon self-requested a review February 10, 2026 08:46
Comment thread src/raglite/_mistral_ocr.py
Comment thread tests/test_mistral_ocr.py
Copy link
Copy Markdown
Contributor

@MattiaMolon MattiaMolon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Left 2 small comments for you @r-dh to evaluate if worth changing.

Comment thread pyproject.toml
Comment thread pyproject.toml Outdated
@emilradix
Copy link
Copy Markdown
Collaborator

@r-dh can you update the PR description to be accurate with the current status? Thanks!

@r-dh
Copy link
Copy Markdown
Contributor Author

r-dh commented Feb 11, 2026

This reminds me I should probably update the README.md as well to reflect these changes. Let me do this first and then update the content of this PR.

@r-dh
Copy link
Copy Markdown
Contributor Author

r-dh commented Feb 12, 2026

I made one more change so end users can more easily define their own images types that they (don't) want to process.

@r-dh r-dh changed the title Add Mistral Document AI for document parsing feat: add Mistral Document AI for document parsing Feb 13, 2026
@emilradix
Copy link
Copy Markdown
Collaborator

@r-dh Please remove the ruff pin and I will merge

@r-dh
Copy link
Copy Markdown
Contributor Author

r-dh commented Feb 16, 2026

@emilradix Ready to merge

@emilradix emilradix merged commit 54f4f80 into main Feb 17, 2026
4 checks passed
@emilradix emilradix deleted the rd-mistral-ocr branch February 17, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants