- 
                Notifications
    You must be signed in to change notification settings 
- Fork 199
feat: mistral ocr converter #2376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
        
      
            anakin87
  merged 56 commits into
  deepset-ai:main
from
Hanseatische-Entwicklungseinheit:add-mistral-ocr
  
      
      
   
  Oct 23, 2025 
      
    
  
     Merged
                    Changes from 53 commits
      Commits
    
    
            Show all changes
          
          
            56 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      d717d46
              
                Revise MCPTool usage example for Streamable HTTP
              
              
                Hansehart c12dd25
              
                Clarify connection types in MCPToolset documentation
              
              
                Hansehart 2b58a33
              
                Merge pull request #1 from Hansehart/patch-1
              
              
                Hansehart 1611199
              
                fix: Align with hatch run fmt requirements
              
              
                Hansehart d8c3ffd
              
                add: MistralOCRDocumentConverter
              
              
                Hansehart 877a3bc
              
                add: project files
              
              
                Hansehart 6fd0394
              
                fix: example lib usage
              
              
                Hansehart 1abfcbb
              
                move: ocr document converter into child /mistral
              
              
                Hansehart 6e16719
              
                add: example usage with annotations
              
              
                Hansehart 6416a0c
              
                add: hatch run fmt
              
              
                Hansehart e2ec0b6
              
                add: mistralai
              
              
                Hansehart a0c2abe
              
                Merge branch 'main' into add-mistral-ocr
              
              
                Hansehart df89124
              
                fix: python3.9 compatibility with using Union, List, Optional
              
              
                Hansehart fc7e31d
              
                Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
              
              
                Hansehart 39b6cae
              
                add: new comments and their position
              
              
                Hansehart f2170f9
              
                add: moved schemas from init into run to bypass problems with seriali…
              
              
                Hansehart 7221af4
              
                add: docstring convention
              
              
                Hansehart aa8f3bc
              
                add: process mutliple documents
              
              
                Hansehart 1160256
              
                add: robust api handling with catching mistral errors
              
              
                Hansehart d351909
              
                add: Union[str, Path, ByteStream] as input
              
              
                Hansehart 4efc546
              
                add: comment for new inputs
              
              
                Hansehart c246153
              
                add: pipeline example
              
              
                Hansehart c665cde
              
                fix: example ocr component
              
              
                Hansehart 0a7cf6a
              
                fix: mistral file upload and pydantic v2 models
              
              
                Hansehart 0fbf500
              
                add: pipeline example
              
              
                Hansehart 6620442
              
                add: hint on document annotation page limit
              
              
                Hansehart 0656d2c
              
                add: mistralai as project dependency
              
              
                Hansehart b5ff05f
              
                fix: hatch run fmt
              
              
                Hansehart 70b81ad
              
                fix: hatch run docs
              
              
                Hansehart 815c92c
              
                Merge branch 'deepset-ai:main' into add-mistral-ocr
              
              
                Hansehart 88302b0
              
                add: to dict, from dict
              
              
                Hansehart 1b48359
              
                Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
              
              
                Hansehart 5030044
              
                add: exlcuse mistral from compliance workflow (its apache 2.0)
              
              
                Hansehart 1e4c0f1
              
                add: 3 initialization tests
              
              
                Hansehart 8f847f0
              
                add: 4 se test
              
              
                Hansehart b1d5729
              
                add: test w/ proper mocking
              
              
                Hansehart 6a02ecf
              
                add: real api test when env is set
              
              
                Hansehart 0cb1e8c
              
                add: delete files by default from mistral if uploaded
              
              
                Hansehart 9b1b29e
              
                fix: mock file deletion
              
              
                Hansehart a4fbdb1
              
                fix: hatch run fmt
              
              
                Hansehart dbbb30a
              
                Apply suggestion from @anakin87
              
              
                Hansehart e57629c
              
                Merge branch 'deepset-ai:main' into add-mistral-ocr
              
              
                Hansehart 0961855
              
                Update integrations/mistral/src/haystack_integrations/components/conv…
              
              
                Hansehart 82f38eb
              
                fix: nested try excepts
              
              
                Hansehart e358660
              
                Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
              
              
                Hansehart fd84e02
              
                add: mention file upload
              
              
                Hansehart cc8dd05
              
                Update integrations/mistral/tests/test_ocr_document_converter.py
              
              
                Hansehart 46b89ff
              
                Update integrations/mistral/tests/test_ocr_document_converter.py
              
              
                Hansehart f92fe6d
              
                add: less test code due to pytest.mark..parametrize
              
              
                Hansehart 99e7989
              
                Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
              
              
                Hansehart 4a5d316
              
                add: less tests and const class type
              
              
                Hansehart 6017c15
              
                fix: format
              
              
                Hansehart e162a73
              
                Merge branch 'main' into add-mistral-ocr
              
              
                Hansehart 30fbc23
              
                add: ocr document converter to docusaurus
              
              
                Hansehart fa193ec
              
                add: converter to mistral
              
              
                Hansehart 5f4216e
              
                Merge branch 'main' into add-mistral-ocr
              
              
                Hansehart File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,87 @@ | ||
| # To run this example, you will need to: | ||
| # 1. Set a `MISTRAL_API_KEY` environment variable | ||
| # 2. Place a PDF file named `sample.pdf` in the same directory as this script | ||
| # | ||
| # This example demonstrates OCR document processing with structured annotations, | ||
| # embedding the extracted documents using Mistral embeddings, and storing them | ||
| # in an InMemoryDocumentStore for later retrieval. | ||
| # | ||
| # You can customize the ImageAnnotation and DocumentAnnotation schemas below | ||
| # to extract different structured information from your documents. | ||
|  | ||
| from typing import List | ||
|  | ||
| from haystack import Pipeline | ||
| from haystack.components.writers import DocumentWriter | ||
| from haystack.document_stores.in_memory import InMemoryDocumentStore | ||
| from mistralai.models import DocumentURLChunk | ||
| from pydantic import BaseModel, Field | ||
|  | ||
| from haystack_integrations.components.converters.mistral.ocr_document_converter import ( | ||
| MistralOCRDocumentConverter, | ||
| ) | ||
| from haystack_integrations.components.embedders.mistral.document_embedder import ( | ||
| MistralDocumentEmbedder, | ||
| ) | ||
|  | ||
|  | ||
| # Define schema for structured image annotations (bbox) | ||
| class ImageAnnotation(BaseModel): | ||
| image_type: str = Field(..., description="The type of image content") | ||
| description: str = Field(..., description="Brief description of the image") | ||
|  | ||
|  | ||
| # Define schema for structured document annotations | ||
| class DocumentAnnotation(BaseModel): | ||
| language: str = Field(..., description="Primary language of the document") | ||
| urls: List[str] = Field(..., description="URLs found in the document") | ||
| topics: List[str] = Field(..., description="Main topics covered in the document") | ||
|  | ||
|  | ||
| # Initialize document store | ||
| document_store = InMemoryDocumentStore() | ||
|  | ||
| # Create indexing pipeline | ||
| indexing_pipeline = Pipeline() | ||
|  | ||
| # Add components to the pipeline | ||
| indexing_pipeline.add_component( | ||
| "converter", | ||
| MistralOCRDocumentConverter(pages=[0, 1]), | ||
| ) | ||
| indexing_pipeline.add_component( | ||
| "embedder", | ||
| MistralDocumentEmbedder(), | ||
| ) | ||
| indexing_pipeline.add_component( | ||
| "writer", | ||
| DocumentWriter(document_store=document_store), | ||
| ) | ||
|  | ||
| # Connect components | ||
| indexing_pipeline.connect("converter.documents", "embedder.documents") | ||
| indexing_pipeline.connect("embedder.documents", "writer.documents") | ||
|  | ||
| # Prepare sources: URL and local file | ||
| sources = [ | ||
| DocumentURLChunk(document_url="https://arxiv.org/pdf/1706.03762"), | ||
| "./sample.pdf", # Local PDF file | ||
| ] | ||
|  | ||
| # Run the pipeline with annotation schemas | ||
| result = indexing_pipeline.run( | ||
| { | ||
| "converter": { | ||
| "sources": sources, | ||
| "bbox_annotation_schema": ImageAnnotation, | ||
| "document_annotation_schema": DocumentAnnotation, | ||
| } | ||
| } | ||
| ) | ||
|  | ||
|  | ||
| # Check out documents processed by OCR. | ||
| # Optional with enriched content (from bbox annotation) and semantic meta data (from document annotation) | ||
| documents = document_store.storage | ||
| # Check out mistral api response for unprocessed data and with usage_info | ||
| raw_mistral_response = result["converter"]["raw_mistral_response"] | 
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              
        
          
  
    
      
          
            3 changes: 3 additions & 0 deletions
          
          3 
        
  integrations/mistral/src/haystack_integrations/components/converters/mistral/__init__.py
  
  
      
      
   
        
      
      
    
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| from .ocr_document_converter import MistralOCRDocumentConverter | ||
|  | ||
| __all__ = ["MistralOCRDocumentConverter"] | 
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.