feat: add Mistral Document AI for document parsing#175
Conversation
|
@r-dh I marked this as draft, as I understood from your description it is not ready yet to be merged. |
|
Is this ready for review? In that case you can mark it as ready for review, currently its a draft |
3a3cb9c to
40b8573
Compare
|
I had to pin ruff to make the tests pass, because the ruff 15.0 release brings a myriad of new linting issues across the repo that should be either addressed in a separate branch, or avoided altogether. There has also been a new |
|
@MattiaMolon Do you mind reviewing? @r-dh Can you open a seperate PR that deals with ruff linting for the rest of the repo? Maybe you can let codex run to deal with a lot of those issues? |
|
Sure, I will review this |
There was a problem hiding this comment.
LGTM!
Left 2 small comments for you @r-dh to evaluate if worth changing.
|
@r-dh can you update the PR description to be accurate with the current status? Thanks! |
|
This reminds me I should probably update the README.md as well to reflect these changes. Let me do this first and then update the content of this PR. |
|
I made one more change so end users can more easily define their own images types that they (don't) want to process. |
|
@r-dh Please remove the ruff pin and I will merge |
|
@emilradix Ready to merge |
Adds Mistral as an optional document processor with
pip install raglite[mistral-ocr].Converts PDFs, images, .docx, .pptx, and more to Markdown via Mistral's OCR API. Uses bbox_annotation_format to classify and describe images found in documents, making visual content searchable as text.
Image categories are user-defined, Mistral constrains its classification to whatever types are provided:
Falls back to the default
pdftext/pandocprocessor for unsupported file types.Note that this includes a
onnxruntimepin for Python 3.10 and aruff <0.15cap, the latter which will be resolved in a separate PR.