-
Notifications
You must be signed in to change notification settings - Fork 199
feat: mistral ocr converter #2376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
anakin87
merged 56 commits into
deepset-ai:main
from
Hanseatische-Entwicklungseinheit:add-mistral-ocr
Oct 23, 2025
Merged
Changes from all commits
Commits
Show all changes
56 commits
Select commit
Hold shift + click to select a range
d717d46
Revise MCPTool usage example for Streamable HTTP
Hansehart c12dd25
Clarify connection types in MCPToolset documentation
Hansehart 2b58a33
Merge pull request #1 from Hansehart/patch-1
Hansehart 1611199
fix: Align with hatch run fmt requirements
Hansehart d8c3ffd
add: MistralOCRDocumentConverter
Hansehart 877a3bc
add: project files
Hansehart 6fd0394
fix: example lib usage
Hansehart 1abfcbb
move: ocr document converter into child /mistral
Hansehart 6e16719
add: example usage with annotations
Hansehart 6416a0c
add: hatch run fmt
Hansehart e2ec0b6
add: mistralai
Hansehart a0c2abe
Merge branch 'main' into add-mistral-ocr
Hansehart df89124
fix: python3.9 compatibility with using Union, List, Optional
Hansehart fc7e31d
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart 39b6cae
add: new comments and their position
Hansehart f2170f9
add: moved schemas from init into run to bypass problems with seriali…
Hansehart 7221af4
add: docstring convention
Hansehart aa8f3bc
add: process mutliple documents
Hansehart 1160256
add: robust api handling with catching mistral errors
Hansehart d351909
add: Union[str, Path, ByteStream] as input
Hansehart 4efc546
add: comment for new inputs
Hansehart c246153
add: pipeline example
Hansehart c665cde
fix: example ocr component
Hansehart 0a7cf6a
fix: mistral file upload and pydantic v2 models
Hansehart 0fbf500
add: pipeline example
Hansehart 6620442
add: hint on document annotation page limit
Hansehart 0656d2c
add: mistralai as project dependency
Hansehart b5ff05f
fix: hatch run fmt
Hansehart 70b81ad
fix: hatch run docs
Hansehart 815c92c
Merge branch 'deepset-ai:main' into add-mistral-ocr
Hansehart 88302b0
add: to dict, from dict
Hansehart 1b48359
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart 5030044
add: exlcuse mistral from compliance workflow (its apache 2.0)
Hansehart 1e4c0f1
add: 3 initialization tests
Hansehart 8f847f0
add: 4 se test
Hansehart b1d5729
add: test w/ proper mocking
Hansehart 6a02ecf
add: real api test when env is set
Hansehart 0cb1e8c
add: delete files by default from mistral if uploaded
Hansehart 9b1b29e
fix: mock file deletion
Hansehart a4fbdb1
fix: hatch run fmt
Hansehart dbbb30a
Apply suggestion from @anakin87
Hansehart e57629c
Merge branch 'deepset-ai:main' into add-mistral-ocr
Hansehart 0961855
Update integrations/mistral/src/haystack_integrations/components/conv…
Hansehart 82f38eb
fix: nested try excepts
Hansehart e358660
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart fd84e02
add: mention file upload
Hansehart cc8dd05
Update integrations/mistral/tests/test_ocr_document_converter.py
Hansehart 46b89ff
Update integrations/mistral/tests/test_ocr_document_converter.py
Hansehart f92fe6d
add: less test code due to pytest.mark..parametrize
Hansehart 99e7989
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart 4a5d316
add: less tests and const class type
Hansehart 6017c15
fix: format
Hansehart e162a73
Merge branch 'main' into add-mistral-ocr
Hansehart 30fbc23
add: ocr document converter to docusaurus
Hansehart fa193ec
add: converter to mistral
Hansehart 5f4216e
Merge branch 'main' into add-mistral-ocr
Hansehart File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # To run this example, you will need to: | ||
| # 1. Set a `MISTRAL_API_KEY` environment variable | ||
| # 2. Place a PDF file named `sample.pdf` in the same directory as this script | ||
| # | ||
| # This example demonstrates OCR document processing with structured annotations, | ||
| # embedding the extracted documents using Mistral embeddings, and storing them | ||
| # in an InMemoryDocumentStore for later retrieval. | ||
| # | ||
| # You can customize the ImageAnnotation and DocumentAnnotation schemas below | ||
| # to extract different structured information from your documents. | ||
|
|
||
| from typing import List | ||
|
|
||
| from haystack import Pipeline | ||
| from haystack.components.writers import DocumentWriter | ||
| from haystack.document_stores.in_memory import InMemoryDocumentStore | ||
| from mistralai.models import DocumentURLChunk | ||
| from pydantic import BaseModel, Field | ||
|
|
||
| from haystack_integrations.components.converters.mistral.ocr_document_converter import ( | ||
| MistralOCRDocumentConverter, | ||
| ) | ||
| from haystack_integrations.components.embedders.mistral.document_embedder import ( | ||
| MistralDocumentEmbedder, | ||
| ) | ||
|
|
||
|
|
||
| # Define schema for structured image annotations (bbox) | ||
| class ImageAnnotation(BaseModel): | ||
| image_type: str = Field(..., description="The type of image content") | ||
| description: str = Field(..., description="Brief description of the image") | ||
|
|
||
|
|
||
| # Define schema for structured document annotations | ||
| class DocumentAnnotation(BaseModel): | ||
| language: str = Field(..., description="Primary language of the document") | ||
| urls: List[str] = Field(..., description="URLs found in the document") | ||
| topics: List[str] = Field(..., description="Main topics covered in the document") | ||
|
|
||
|
|
||
| # Initialize document store | ||
| document_store = InMemoryDocumentStore() | ||
|
|
||
| # Create indexing pipeline | ||
| indexing_pipeline = Pipeline() | ||
|
|
||
| # Add components to the pipeline | ||
| indexing_pipeline.add_component( | ||
| "converter", | ||
| MistralOCRDocumentConverter(pages=[0, 1]), | ||
| ) | ||
| indexing_pipeline.add_component( | ||
| "embedder", | ||
| MistralDocumentEmbedder(), | ||
| ) | ||
| indexing_pipeline.add_component( | ||
| "writer", | ||
| DocumentWriter(document_store=document_store), | ||
| ) | ||
|
|
||
| # Connect components | ||
| indexing_pipeline.connect("converter.documents", "embedder.documents") | ||
| indexing_pipeline.connect("embedder.documents", "writer.documents") | ||
|
|
||
| # Prepare sources: URL and local file | ||
| sources = [ | ||
| DocumentURLChunk(document_url="https://arxiv.org/pdf/1706.03762"), | ||
| "./sample.pdf", # Local PDF file | ||
| ] | ||
|
|
||
| # Run the pipeline with annotation schemas | ||
| result = indexing_pipeline.run( | ||
| { | ||
| "converter": { | ||
| "sources": sources, | ||
| "bbox_annotation_schema": ImageAnnotation, | ||
| "document_annotation_schema": DocumentAnnotation, | ||
| } | ||
| } | ||
| ) | ||
|
|
||
|
|
||
| # Check out documents processed by OCR. | ||
| # Optional with enriched content (from bbox annotation) and semantic meta data (from document annotation) | ||
| documents = document_store.storage | ||
| # Check out mistral api response for unprocessed data and with usage_info | ||
| raw_mistral_response = result["converter"]["raw_mistral_response"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3 changes: 3 additions & 0 deletions
3
integrations/mistral/src/haystack_integrations/components/converters/mistral/__init__.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| from .ocr_document_converter import MistralOCRDocumentConverter | ||
|
|
||
| __all__ = ["MistralOCRDocumentConverter"] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.