-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Hi team, first off—thank you for your incredible work. The performance and memory efficiency here are a massive step up from Ollama. It’s allowed me to run models on my GPU that were previously out of reach.
The Challenge
While the current documentation is helpful, the manual tool template configuration creates friction for generic "agentic" use cases eg autonomous coding with efficient local GPU inference.
- Compatibility: Frameworks like SWE-Agent and Open Hands rely on the OpenAI-standardized tool-calling format via LiteLLM.
- Current State: The "hacky" integration required for LiteLLM limits TabbyAPI’s accessibility for automated agent workflows.
The Proposal: Pydantic-based Structured Output
A LiteLLM integration/compatibility, where TabbyAPI is integrated in a similar manner as VLLM would be an extraordinary addition Currently LiteLLM doesnt support any local-inference engine that is optimized for edge-deployment on consumer hardware.
This is where Tabby API really shines, & with proper integration of toolcalling/reasoning this project could become the leading choice for highly efficient edge-deployment for agentic ecosystems on consumer hardware.
- Goal: Enable TabbyAPI to accept a Pydantic schema and return a structured
ChatCompletionobject containing the tool call(s). - Benefit: This removes the need for users to manually define/configure tool templates (as seen in Wiki #10).
- Integration: Similar to how vLLM integrates with LiteLLM, this would allow TabbyAPI to function as a drop-in provider for standardized agentic tools.
I understand this may involve significant architectural changes. I’m raising this because TabbyAPI occupies a unique and vital niche for consumer-grade GPU inference that other engines currently miss.