[Feature]: Add vision/multimodal model type for image, video, and pdf analysis

### Feature Description

Add a new `vision` model type to esperanto that supports multimodal LLMs capable of processing images, videos, and pdfs. This would enable:

1. A `create_vision()` factory method (or extend `create_language()` with vision capability detection)
2. Image input support via base64-encoded images in message content
3. Provider-specific handling for vision-capable models (GPT-4o, Claude 3, Gemini Pro Vision, LLaVA, etc.)

---

### Why would this be helpful?

Vision-capable LLMs are increasingly common, but esperanto currently only supports text-based language models. Adding vision support would enable:

- **PDF analysis** - Render pages as images for visual understanding (tables, charts, layouts)
- **Video processing** - Extract frames and analyze visual content
- **Image-heavy documents** - Process scanned documents, diagrams, screenshots
- **Multi-provider flexibility** - Use local models (LLaVA via Ollama) or cloud APIs (GPT-4o, Claude 3, Gemini)

This aligns with esperanto's goal of providing a unified interface across AI providers.

---

A proof-of-concept implementation exists in a closed PR: https://github.com/lfnovo/open-notebook/pull/533

That implementation includes:
- `default_vision_model` configuration field
- `get_vision_model()` method with fallback to chat model
- Utilities for image encoding, video frame extraction, and PDF page rendering

The vision model provisioning works with the existing LangChain `.to_langchain()` interface - the main gap is esperanto's model registry and capability detection.

---

### Contribution

- [x] I am a developer and would like to work on implementing this feature (pending maintainer approval)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add vision/multimodal model type for image, video, and pdf analysis #82

Feature Description

Why would this be helpful?

Contribution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature]: Add vision/multimodal model type for image, video, and pdf analysis #82

Description

Feature Description

Why would this be helpful?

Contribution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions