Skip to content

Commit 0f9a274

Browse files
committed
docs: add Mistral OCR usage documentation to README
1 parent b3b43fe commit 0f9a274

1 file changed

Lines changed: 22 additions & 0 deletions

File tree

README.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB
3434
- 🔌 A built-in [Model Context Protocol](https://modelcontextprotocol.io) (MCP) server that any MCP client like [Claude desktop](https://claude.ai/download) can connect with
3535
- 💬 Optional customizable ChatGPT-like frontend for [web](https://docs.chainlit.io/deploy/copilot), [Slack](https://docs.chainlit.io/deploy/slack), and [Teams](https://docs.chainlit.io/deploy/teams) with [Chainlit](https://github.com/Chainlit/chainlit)
3636
- ✍️ Optional conversion of any input document to Markdown with [Pandoc](https://github.com/jgm/pandoc)
37+
- 🔎 Optional high-quality document processing with [Mistral OCR](https://docs.mistral.ai/capabilities/document/) for PDFs, images, DOCX, and PPTX with automatic image descriptions
3738
- ✅ Optional evaluation of retrieval and generation performance with [Ragas](https://github.com/explodinggradients/ragas)
3839

3940
## Installing
@@ -69,6 +70,12 @@ To add support for filetypes other than PDF, use the `pandoc` extra:
6970
pip install raglite[pandoc]
7071
```
7172

73+
To add support for high-quality document processing with [Mistral OCR](https://docs.mistral.ai/capabilities/document/), use the `mistral-ocr` extra:
74+
75+
```sh
76+
pip install raglite[mistral-ocr]
77+
```
78+
7279
To add support for evaluation, use the `ragas` extra:
7380

7481
```sh
@@ -152,6 +159,21 @@ my_config = RAGLiteConfig(
152159
> [!TIP]
153160
> ✍️ To insert documents other than PDF, install the `pandoc` extra with `pip install raglite[pandoc]`.
154161
162+
> [!TIP]
163+
> 🔎 For higher-quality document processing with automatic image descriptions, install the `mistral-ocr` extra with `pip install raglite[mistral-ocr]` and configure it as follows:
164+
> ```python
165+
> from raglite import RAGLiteConfig, MistralOCRConfig
166+
>
167+
> my_config = RAGLiteConfig(
168+
> document_processor=MistralOCRConfig(
169+
> include_image_descriptions=True, # Describe images, charts, and diagrams as text
170+
> image_types=frozenset({"chart", "diagram", "photo", "table", "logo", "icon"}), # Custom image categories
171+
> exclude_image_types=frozenset({"logo", "icon"}), # Filter out specific types from the output
172+
> ),
173+
> )
174+
> ```
175+
> The `image_types` parameter defines the categories that Mistral classifies each image into — you can use the defaults or provide your own domain-specific types. Use `exclude_image_types` to filter out any classified types that are not useful for retrieval.
176+
155177
Next, insert some documents into the database. RAGLite will take care of the [conversion to Markdown](src/raglite/_markdown.py), [optimal level 4 semantic chunking](src/raglite/_split_chunks.py), and [multi-vector embedding with late chunking](src/raglite/_embed.py):
156178
157179
```python

0 commit comments

Comments
 (0)