-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
Feature Description
Add a new vision model type to esperanto that supports multimodal LLMs capable of processing images, videos, and pdfs. This would enable:
- A
create_vision()factory method (or extendcreate_language()with vision capability detection) - Image input support via base64-encoded images in message content
- Provider-specific handling for vision-capable models (GPT-4o, Claude 3, Gemini Pro Vision, LLaVA, etc.)
Why would this be helpful?
Vision-capable LLMs are increasingly common, but esperanto currently only supports text-based language models. Adding vision support would enable:
- PDF analysis - Render pages as images for visual understanding (tables, charts, layouts)
- Video processing - Extract frames and analyze visual content
- Image-heavy documents - Process scanned documents, diagrams, screenshots
- Multi-provider flexibility - Use local models (LLaVA via Ollama) or cloud APIs (GPT-4o, Claude 3, Gemini)
This aligns with esperanto's goal of providing a unified interface across AI providers.
A proof-of-concept implementation exists in a closed PR: lfnovo/open-notebook#533
That implementation includes:
default_vision_modelconfiguration fieldget_vision_model()method with fallback to chat model- Utilities for image encoding, video frame extraction, and PDF page rendering
The vision model provisioning works with the existing LangChain .to_langchain() interface - the main gap is esperanto's model registry and capability detection.
Contribution
- I am a developer and would like to work on implementing this feature (pending maintainer approval)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels