Skip to content

Conversation

@Hansehart
Copy link
Contributor

@Hansehart Hansehart commented Oct 13, 2025

Related Issues

Proposed Changes:

This PR adds a new OCR document converter component for the Mistral integration in Haystack.
The MistralOCRDocumentConverter uses Mistral’s Document AI / OCR API to extract text and structured annotations from documents and images.

Key features:

  • Automatically converts PDFs output into a single Haystack Document object using OCR for hard parsable formats.
  • Supports multiple document input types (DocumentURLChunk, FileChunk (La Plateforme), ImageURLChunk).
  • Optionally enriches image regions with structured annotations via Pydantic schemas.
  • Supports document-level annotations (e.g. language detection, extracted URLs, any other definied metadata).

How did you test it?

  • Manual verification using real-world PDFs and images (e.g. public arXiv PDFs).
  • Confirmed Mistral OCR API call returns structured responses with both bbox and document-level annotations.
  • Verified Document creation and metadata enrichment in Haystack.

Notes for the reviewer

There are definetly some things missing. Its works great on my structure, however needs adjustments for haystack. Including:

  • Tests
  • Docs
  • Package Integration e.g in the example i use haystack_integrations.components.converters.mistral.ocr_document_converter which can not be resolved
  • Due to using pipe union syntax (| instead Optional[]) its for python>=3.10 but can be adjusted
  • Mistral API key required
  • The other components are not using mistral sdk. I am unsure if its required to rely only on openai or can I use mistral SDK?
  • I developed this component for my own purpose but want to share it. When you provide me some feedback on whats needs to be done, I will tackle that. Functionality is working, for the integrations steps i am going proactively into the pr nowing it need adjustments.

Checklist

@Hansehart Hansehart requested a review from a team as a code owner October 13, 2025 23:05
@Hansehart Hansehart requested review from anakin87 and removed request for a team October 13, 2025 23:05
@github-actions github-actions bot added integration:mistral integration:mcp type:documentation Improvements or additions to documentation labels Oct 13, 2025
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. I have left some initial comments about the implementation. In general, I would make this component more similar to the AzureOCRDocumentConverter.

Other points you mentioned:

  • It is OK to use Mistral SDK
  • We want to keep Python 3.9 compatibility for the moment, so let's use Union.
  • Once we agree on the main aspects of the implementation, let's add unit and integration tests.

@Hansehart
Copy link
Contributor Author

I guess I have currently the last issue that I cant run hatch run fmt, because then my

  "type": (
      "haystack_integrations.components.converters.mistral."
      "ocr_document_converter.MistralOCRDocumentConverter"
  ),

Will become: "haystack_integrations.components.converters.mistral.ocr_document_converter.MistralOCRDocumentConverter"

And then three lines exceeding line limit 125>120

@Hansehart Hansehart requested a review from anakin87 October 15, 2025 18:48
Co-authored-by: Stefano Fiorucci <[email protected]>
@Hansehart Hansehart requested a review from anakin87 October 19, 2025 23:30
@anakin87
Copy link
Member

Hey @Hansehart, I've been a bit busy... I'll review it again soon

@Hansehart
Copy link
Contributor Author

Hansehart commented Oct 21, 2025

Take your time. I have no deadline on this feature. Instead I realy appreciate your valulable feedback. It makes me really fun to contribute.

@anakin87
Copy link
Member

Please update https://github.com/deepset-ai/haystack-core-integrations/blob/main/README.md
"Embedder, Generator" -> "Converter, Embedder, Generator"


In general, we are close to merging this PR. 👍

@Hansehart Hansehart requested a review from anakin87 October 23, 2025 07:16
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran integration tests locally (they don't run on PRs from forks) and all works well.

Thank you!

@anakin87 anakin87 merged commit 890eb5c into deepset-ai:main Oct 23, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration:mistral topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants