Skip to content

[Feature]: Support open-weight VLM backends (vLLM, Ollama, etc.) #114

@statxc

Description

@statxc

Problem or motivation

Problem

The pipeline supports five VLM providers (Gemini, OpenAI, OpenRouter, Bedrock, Anthropic) and four image generation providers, but all of them require a hosted API with API keys and usage costs. There's no documented or tested path for running open-weight models like Qwen-VL, LLaVA, or Llama locally.

The OpenAI provider accepts a custom base_url via OPENAI_BASE_URL, so in theory you could point it at a local vLLM or Ollama server. But this hasn't been tested with open-weight models, and there are known compatibility gaps:

  • JSON mode: The Retriever and Critic agents pass response_format="json", which the OpenAI provider sends as {"type": "json_object"}. Many open-weight models don't support this.
  • Vision: The Critic sends the generated image to the VLM for evaluation, and the Planner sends reference example images for in-context learning. Both agents require a vision-capable model, so text-only models won't work.

Proposed solution

  1. Test the existing OpenAI provider with vLLM and Ollama - Before writing new providers, figure out what already works. Point OPENAI_BASE_URL at a local vLLM or Ollama server running a vision-capable model (e.g. Qwen2.5-VL, LLaVA) and run the pipeline end-to-end. Document what breaks.

  2. Handle JSON mode gracefully - Many open-weight models don't support response_format: json_object. Two options: let providers declare whether they support JSON mode so the flag is skipped when they don't, or lean on the fallback parsing that the Retriever and Critic already have for malformed JSON responses. Probably both.

  3. Add a dedicated Ollama provider if needed - If the OpenAI-compatible base_url approach doesn't cover Ollama well enough, add an OllamaVLM provider.

  4. Document tested combinations - List which open-weight models have been tested, what works, and what doesn't. For example: "Qwen2.5-VL via vLLM: planning and critic vision work, JSON mode needs fallback." Users need to know what to expect before they spend time setting up a local server.

Area

New provider support (OpenAI, Anthropic, local models, etc.)

Alternatives considered

  • OpenRouter only - OpenRouter already routes to open-weight models and the provider exists, but it's still a hosted API with costs. Doesn't address the "run locally for free" use case.
  • Dedicated providers for every backend (vLLM, Ollama, llama.cpp, etc.) - Most of these expose OpenAI-compatible APIs, so separate providers are likely overkill. Better to test the base_url approach first and only add dedicated providers where compatibility actually breaks.

Willingness to contribute

  • I'd be willing to submit a PR for this feature

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions