Skip to content
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
d717d46
Revise MCPTool usage example for Streamable HTTP
Hansehart Sep 26, 2025
c12dd25
Clarify connection types in MCPToolset documentation
Hansehart Sep 26, 2025
2b58a33
Merge pull request #1 from Hansehart/patch-1
Hansehart Sep 26, 2025
1611199
fix: Align with hatch run fmt requirements
Hansehart Sep 26, 2025
d8c3ffd
add: MistralOCRDocumentConverter
Hansehart Oct 13, 2025
877a3bc
add: project files
Hansehart Oct 13, 2025
6fd0394
fix: example lib usage
Hansehart Oct 13, 2025
1abfcbb
move: ocr document converter into child /mistral
Hansehart Oct 13, 2025
6e16719
add: example usage with annotations
Hansehart Oct 13, 2025
6416a0c
add: hatch run fmt
Hansehart Oct 13, 2025
e2ec0b6
add: mistralai
Hansehart Oct 13, 2025
a0c2abe
Merge branch 'main' into add-mistral-ocr
Hansehart Oct 13, 2025
df89124
fix: python3.9 compatibility with using Union, List, Optional
Hansehart Oct 14, 2025
fc7e31d
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart Oct 14, 2025
39b6cae
add: new comments and their position
Hansehart Oct 14, 2025
f2170f9
add: moved schemas from init into run to bypass problems with seriali…
Hansehart Oct 14, 2025
7221af4
add: docstring convention
Hansehart Oct 14, 2025
aa8f3bc
add: process mutliple documents
Hansehart Oct 14, 2025
1160256
add: robust api handling with catching mistral errors
Hansehart Oct 14, 2025
d351909
add: Union[str, Path, ByteStream] as input
Hansehart Oct 14, 2025
4efc546
add: comment for new inputs
Hansehart Oct 14, 2025
c246153
add: pipeline example
Hansehart Oct 14, 2025
c665cde
fix: example ocr component
Hansehart Oct 14, 2025
0a7cf6a
fix: mistral file upload and pydantic v2 models
Hansehart Oct 14, 2025
0fbf500
add: pipeline example
Hansehart Oct 14, 2025
6620442
add: hint on document annotation page limit
Hansehart Oct 14, 2025
0656d2c
add: mistralai as project dependency
Hansehart Oct 14, 2025
b5ff05f
fix: hatch run fmt
Hansehart Oct 14, 2025
70b81ad
fix: hatch run docs
Hansehart Oct 14, 2025
815c92c
Merge branch 'deepset-ai:main' into add-mistral-ocr
Hansehart Oct 15, 2025
88302b0
add: to dict, from dict
Hansehart Oct 15, 2025
1b48359
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart Oct 15, 2025
5030044
add: exlcuse mistral from compliance workflow (its apache 2.0)
Hansehart Oct 15, 2025
1e4c0f1
add: 3 initialization tests
Hansehart Oct 15, 2025
8f847f0
add: 4 se test
Hansehart Oct 15, 2025
b1d5729
add: test w/ proper mocking
Hansehart Oct 15, 2025
6a02ecf
add: real api test when env is set
Hansehart Oct 15, 2025
0cb1e8c
add: delete files by default from mistral if uploaded
Hansehart Oct 15, 2025
9b1b29e
fix: mock file deletion
Hansehart Oct 15, 2025
a4fbdb1
fix: hatch run fmt
Hansehart Oct 15, 2025
dbbb30a
Apply suggestion from @anakin87
Hansehart Oct 15, 2025
e57629c
Merge branch 'deepset-ai:main' into add-mistral-ocr
Hansehart Oct 19, 2025
0961855
Update integrations/mistral/src/haystack_integrations/components/conv…
Hansehart Oct 19, 2025
82f38eb
fix: nested try excepts
Hansehart Oct 19, 2025
e358660
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart Oct 19, 2025
fd84e02
add: mention file upload
Hansehart Oct 19, 2025
cc8dd05
Update integrations/mistral/tests/test_ocr_document_converter.py
Hansehart Oct 19, 2025
46b89ff
Update integrations/mistral/tests/test_ocr_document_converter.py
Hansehart Oct 19, 2025
f92fe6d
add: less test code due to pytest.mark..parametrize
Hansehart Oct 19, 2025
99e7989
Merge branch 'add-mistral-ocr' of github.com:Hansehart/haystack-core-…
Hansehart Oct 19, 2025
4a5d316
add: less tests and const class type
Hansehart Oct 19, 2025
6017c15
fix: format
Hansehart Oct 19, 2025
e162a73
Merge branch 'main' into add-mistral-ocr
Hansehart Oct 21, 2025
30fbc23
add: ocr document converter to docusaurus
Hansehart Oct 22, 2025
fa193ec
add: converter to mistral
Hansehart Oct 22, 2025
5f4216e
Merge branch 'main' into add-mistral-ocr
Hansehart Oct 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/CI_license_compliance.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,14 @@ on:
env:
CORE_DATADOG_API_KEY: ${{ secrets.CORE_DATADOG_API_KEY }}
PYTHON_VERSION: "3.10"
EXCLUDE_PACKAGES: "(?i)^(azure-identity|fastembed|ragas|tqdm|psycopg).*"
EXCLUDE_PACKAGES: "(?i)^(azure-identity|fastembed|ragas|tqdm|psycopg|mistralai).*"

# Exclusions must be explicitly motivated
#
# - azure-identity is MIT but the license is not available on PyPI
# - fastembed is Apache 2.0 but the license on PyPI is unclear ("Other/Proprietary License (Apache License)")
# - ragas is Apache 2.0 but the license is not available on PyPI
# - mistralai is Apache 2.0 but the license is not available on PyPI

# - tqdm is MLP but there are no better alternatives
# - psycopg is LGPL-3.0 but FOSSA is fine with it
Expand Down
87 changes: 87 additions & 0 deletions integrations/mistral/examples/indexing_ocr_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# To run this example, you will need to:
# 1. Set a `MISTRAL_API_KEY` environment variable
# 2. Place a PDF file named `sample.pdf` in the same directory as this script
#
# This example demonstrates OCR document processing with structured annotations,
# embedding the extracted documents using Mistral embeddings, and storing them
# in an InMemoryDocumentStore for later retrieval.
#
# You can customize the ImageAnnotation and DocumentAnnotation schemas below
# to extract different structured information from your documents.

from typing import List

from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from mistralai.models import DocumentURLChunk
from pydantic import BaseModel, Field

from haystack_integrations.components.converters.mistral.ocr_document_converter import (
MistralOCRDocumentConverter,
)
from haystack_integrations.components.embedders.mistral.document_embedder import (
MistralDocumentEmbedder,
)


# Define schema for structured image annotations (bbox)
class ImageAnnotation(BaseModel):
image_type: str = Field(..., description="The type of image content")
description: str = Field(..., description="Brief description of the image")


# Define schema for structured document annotations
class DocumentAnnotation(BaseModel):
language: str = Field(..., description="Primary language of the document")
urls: List[str] = Field(..., description="URLs found in the document")
topics: List[str] = Field(..., description="Main topics covered in the document")


# Initialize document store
document_store = InMemoryDocumentStore()

# Create indexing pipeline
indexing_pipeline = Pipeline()

# Add components to the pipeline
indexing_pipeline.add_component(
"converter",
MistralOCRDocumentConverter(pages=[0, 1]),
)
indexing_pipeline.add_component(
"embedder",
MistralDocumentEmbedder(),
)
indexing_pipeline.add_component(
"writer",
DocumentWriter(document_store=document_store),
)

# Connect components
indexing_pipeline.connect("converter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# Prepare sources: URL and local file
sources = [
DocumentURLChunk(document_url="https://arxiv.org/pdf/1706.03762"),
"./sample.pdf", # Local PDF file
]

# Run the pipeline with annotation schemas
result = indexing_pipeline.run(
{
"converter": {
"sources": sources,
"bbox_annotation_schema": ImageAnnotation,
"document_annotation_schema": DocumentAnnotation,
}
}
)


# Check out documents processed by OCR.
# Optional with enriched content (from bbox annotation) and semantic meta data (from document annotation)
documents = document_store.storage
# Check out mistral api response for unprocessed data and with usage_info
raw_mistral_response = result["converter"]["raw_mistral_response"]
1 change: 1 addition & 0 deletions integrations/mistral/pydoc/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ loaders:
"haystack_integrations.components.embedders.mistral.document_embedder",
"haystack_integrations.components.embedders.mistral.text_embedder",
"haystack_integrations.components.generators.mistral.chat.chat_generator",
"haystack_integrations.components.converters.mistral.ocr_document_converter",
]
ignore_when_discovered: ["__init__"]
processors:
Expand Down
7 changes: 4 additions & 3 deletions integrations/mistral/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ classifiers = [
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = ["haystack-ai>=2.15.1"]
dependencies = ["haystack-ai>=2.15.1", "mistralai>=1.9.11"]

[project.urls]
Documentation = "https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mistral#readme"
Expand Down Expand Up @@ -58,7 +58,7 @@ dependencies = [
"pytest-rerunfailures",
"mypy",
"pip",
"pytz"
"pytz",
]

[tool.hatch.envs.test.scripts]
Expand All @@ -68,7 +68,8 @@ all = 'pytest {args:tests}'
cov-retry = 'all --cov=haystack_integrations --reruns 3 --reruns-delay 30 -x'

types = """mypy -p haystack_integrations.components.embedders.mistral \
-p haystack_integrations.components.generators.mistral {args}"""
-p haystack_integrations.components.generators.mistral \
-p haystack_integrations.components.converters {args}"""

[tool.mypy]
install_types = true
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .ocr_document_converter import MistralOCRDocumentConverter

__all__ = ["MistralOCRDocumentConverter"]
Loading
Loading