Skip to content

[Feature]: Add vision/multimodal model type for image, video, and pdf analysis #82

@kevincolten

Description

@kevincolten

Feature Description

Add a new vision model type to esperanto that supports multimodal LLMs capable of processing images, videos, and pdfs. This would enable:

  1. A create_vision() factory method (or extend create_language() with vision capability detection)
  2. Image input support via base64-encoded images in message content
  3. Provider-specific handling for vision-capable models (GPT-4o, Claude 3, Gemini Pro Vision, LLaVA, etc.)

Why would this be helpful?

Vision-capable LLMs are increasingly common, but esperanto currently only supports text-based language models. Adding vision support would enable:

  • PDF analysis - Render pages as images for visual understanding (tables, charts, layouts)
  • Video processing - Extract frames and analyze visual content
  • Image-heavy documents - Process scanned documents, diagrams, screenshots
  • Multi-provider flexibility - Use local models (LLaVA via Ollama) or cloud APIs (GPT-4o, Claude 3, Gemini)

This aligns with esperanto's goal of providing a unified interface across AI providers.


A proof-of-concept implementation exists in a closed PR: lfnovo/open-notebook#533

That implementation includes:

  • default_vision_model configuration field
  • get_vision_model() method with fallback to chat model
  • Utilities for image encoding, video frame extraction, and PDF page rendering

The vision model provisioning works with the existing LangChain .to_langchain() interface - the main gap is esperanto's model registry and capability detection.


Contribution

  • I am a developer and would like to work on implementing this feature (pending maintainer approval)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions