Skip to content

Releases: muxi-ai/onellm

v0.20260130.0

30 Jan 14:53

Choose a tag to compare

OneLLM v0.20260130.0

v0.20260127.0

27 Jan 22:35

Choose a tag to compare

OneLLM v0.20260127.0

0.20251222.0

24 Dec 11:43

Choose a tag to compare

0.20251222.0 - Semantic Cache Improvements

Status: Development Status :: 5 - Production/Stable

Bug Fixes

Semantic Cache False Positives

Fixed critical issues with the semantic cache that caused incorrect cache matches:

  1. System Prompt Hash Matching: The semantic cache now includes a hash of the system prompt when matching cached responses. Previously, different LLM operations with similar user messages but different system prompts could incorrectly return cached responses from unrelated operations.

  2. Short Text Exclusion: Messages shorter than 128 characters are now excluded from semantic matching (configurable via min_text_length). Short questions have misleadingly high semantic similarity scores which caused false cache hits. These short messages still benefit from exact hash matching.

  3. Stricter Default Threshold: Default similarity threshold increased from 0.95 to 0.98 for more reliable matching.

Changes

  • Added _extract_system_hash() method to compute SHA256 hash of system prompt content
  • Modified _semantic_search() to require both semantic similarity AND system hash match
  • Added configurable min_text_length parameter (default: 128 chars) before semantic cache operations
  • Changed default similarity_threshold from 0.95 to 0.98
  • Added caching parameter to ChatCompletion.create/acreate for per-call cache bypass

Full Changelog: 0.20251218.0...0.20251222.0

0.20251218.0

22 Dec 00:01

Choose a tag to compare

Breaking Changes

This release contains breaking changes to exception class names. These changes improve Python compatibility and brand consistency.

Exception Renames (PR #9 - Python Builtin Shadowing Fix)

The following exceptions were renamed to avoid shadowing Python's built-in exception names:

Old Name New Name
TimeoutError RequestTimeoutError
PermissionError PermissionDeniedError

Migration:

# Before
from onellm.exceptions import TimeoutError, PermissionError

try:
    response = client.chat.completions.create(...)
except TimeoutError:
    print("Request timed out")
except PermissionError:
    print("Permission denied")

# After
from onellm.exceptions import RequestTimeoutError, PermissionDeniedError

try:
    response = client.chat.completions.create(...)
except RequestTimeoutError:
    print("Request timed out")
except PermissionDeniedError:
    print("Permission denied")

Base Exception Rename (PR #10 - Brand Consistency)

The base exception class was renamed for brand consistency:

Old Name New Name
MuxiLLMError OneLLMError

Migration:

# Before
from onellm.exceptions import MuxiLLMError

try:
    response = client.chat.completions.create(...)
except MuxiLLMError as e:
    print(f"OneLLM error: {e}")

# After
from onellm.exceptions import OneLLMError

try:
    response = client.chat.completions.create(...)
except OneLLMError as e:
    print(f"OneLLM error: {e}")

Improvements

  • Exception Chaining: All exceptions now use proper exception chaining (raise ... from e) for better debugging and stack traces
  • Test Suite Fixes: Fixed test suite issues including state pollution between tests and improved mocking patterns

Technical Details

  • Exceptions no longer shadow Python builtins, preventing subtle bugs when catching exceptions
  • All 373 unit tests passing with improved test isolation
  • Exception hierarchy remains unchanged - only class names were updated

What's Changed

  • Fix builtin shadowing and exception handling bugs by @Copilot in #9

New Contributors

  • @Copilot made their first contribution in #9

Full Changelog: 0.20251121.0...0.20251218.0

0.20251121.0

21 Nov 10:33

Choose a tag to compare

MiniMax Provider Support

We're excited to announce support for MiniMax's M2 model series with advanced reasoning capabilities! This release introduces a new provider and establishes a reusable architecture for Anthropic-compatible APIs.

🆕 What's New

MiniMax Provider

Access MiniMax's powerful M2 models through OneLLM's unified interface:

  • Two model variants:

    • minimax/MiniMax-M2 - Agentic capabilities with advanced reasoning
    • minimax/MiniMax-M2-Stable - Optimized for high concurrency and commercial use
  • Key capabilities:

    • 🧠 Interleaved thinking - Enable step-by-step reasoning for complex tasks
    • 🔧 Tool calling - Function calling support for agentic workflows
    • Streaming - Real-time token-by-token responses
    • 🌍 Global availability - International and China endpoints

Quick Start

from onellm import ChatCompletion

# Basic chat completion
response = ChatCompletion.create(
    model="minimax/MiniMax-M2",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=1000
)

# Advanced reasoning with interleaved thinking
response = ChatCompletion.create(
    model="minimax/MiniMax-M2",
    messages=[{"role": "user", "content": "Solve: A train travels 60mph for 2hrs, then 80mph for 3hrs. Total distance?"}],
    max_tokens=500,
    thinking={"enabled": True, "budget_tokens": 20000}
)

Configuration

# Set your API key
export MINMAX_API_KEY="your-api-key"

# For users in China, configure the China endpoint
export ONELLM_PROVIDERS__MINIMAX__API_BASE=https://api.minimaxi.com/anthropic

Get your API key at: https://platform.minimax.io/

🏗️ Architecture Improvements

This release introduces AnthropicCompatibleProvider - a new base class that makes it easy to integrate providers offering Anthropic-compatible APIs. This mirrors our successful OpenAICompatibleProvider pattern and opens the door for rapid integration of future Anthropic-compatible services.

Benefits:

  • ✅ Minimal code duplication (~65 lines vs ~850 lines)
  • ✅ Consistent behavior across compatible providers
  • ✅ Automatic inheritance of all Anthropic features
  • ✅ Zero breaking changes to existing code

📊 By The Numbers

  • Total Providers: 22 (up from 21)
  • New Tests: 15 comprehensive unit tests
  • Test Coverage: 100% for new code
  • New Files: 4 (provider, base class, tests, examples)
  • Lines Added: 872

📚 Documentation

🔗 Resources

🙏 Acknowledgments

Special thanks to the MiniMax team for providing an Anthropic-compatible API, making this integration straightforward and maintaining consistency across the LLM ecosystem.

🚀 What's Next?

With the new AnthropicCompatibleProvider architecture in place, we're ready to quickly add more Anthropic-compatible providers. Stay tuned for more updates!


Full Changelog: v0.20251013.0...v0.20251121.0

0.20251013.0

13 Oct 16:52
966d1aa

Choose a tag to compare

Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching

  • 42,000-143,000x faster responses: Cache hits return in ~7µs vs 300-1000ms for API calls
  • 50-80% cost savings: Dramatically reduces API costs through intelligent caching
  • Zero ongoing API costs: Uses local multilingual embedding model (paraphrase-multilingual-MiniLM-L12-v2)
  • Two-tier matching: Hash-based exact matching (~2µs) with semantic similarity fallback (~18ms)
  • Streaming support: Artificial streaming for cached responses preserves natural UX
  • TTL with refresh-on-access: Configurable time-to-live (default: 86400s / 1 day)
  • 50+ language support: Multilingual semantic matching out of the box
  • LRU eviction: Memory-bounded with configurable max entries (default: 1000)

Cache Configuration

import onellm

# Initialize semantic cache
onellm.init_cache(
    max_entries=1000,           # Maximum cache entries
    p=0.95,                     # Similarity threshold (0-1)
    hash_only=False,            # Enable semantic matching
    stream_chunk_strategy="words",  # Streaming chunking: words/sentences/paragraphs/characters
    stream_chunk_length=8,      # Chunks per yield
    ttl=86400                   # Time-to-live in seconds (1 day)
)

# Use cache with any provider
response = onellm.ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Cache management
stats = onellm.cache_stats()    # Get hit/miss/entries stats
onellm.clear_cache()            # Clear all entries
onellm.disable_cache()          # Disable caching

Performance Benchmarks

  • Hash exact match: ~2µs (2,000,000% faster than API)
  • Semantic match: ~18ms (1,500-5,000% faster than API)
  • Typical API call: 300-1000ms
  • Streaming simulation: Instant cached response with natural chunked delivery
  • Model download: One-time 118MB download (~13s on first init)

Technical Details

  • Dependencies: Added sentence-transformers>=2.0.0 and faiss-cpu>=1.7.0 to core dependencies
  • Memory-only: In-memory cache for long-running processes (no persistence)
  • Thread-safe: OrderedDict-based LRU with atomic operations
  • Streaming chunking: Four strategies (words, sentences, paragraphs, characters) for natural streaming UX
  • TTL refresh: Cache hits refresh TTL, keeping frequently-used entries alive
  • Hash key filtering: Excludes non-semantic parameters (stream, timeout, metadata) from cache key

Documentation

  • New docs: Comprehensive docs/caching.md with architecture, usage, and best practices
  • Updated README: Highlighted semantic caching in Key Features and Advanced Features
  • Updated docs: Added caching to docs/README.md, docs/advanced-features.md, and docs/quickstart.md
  • Examples: Added examples/cache_example.py demonstrating all cache features

Use Cases

Ideal for:

  • High-traffic web applications with repeated queries
  • Interactive demos and chatbots
  • Development and testing environments
  • API cost optimization
  • Latency-sensitive applications

Limited for:

  • Stateless serverless functions (short-lived processes)
  • Highly unique, non-repetitive queries
  • Contexts requiring strict data freshness

What's Changed

  • feat: add blazing-fast semantic caching with multilingual support by @ranaroussi in #5

Full Changelog: 0.20251008.0...0.20251013.0

0.20251008.0

08 Oct 23:32

Choose a tag to compare

0.20251008.0 - ScalVer Adoption

Status: Development Status :: 5 - Production/Stable

Versioning Change

  • ScalVer Adoption: OneLLM now uses ScalVer (Scalable Calendar Versioning) instead of Semantic Versioning
    • Version format: MAJOR.YYYYMMDD.PATCH (daily cadence)
    • Current version: 0.20251008.0 (October 8, 2025)
    • MAJOR = 0 indicates alpha/experimental status per ScalVer convention
    • DATE segment uses daily format (YYYYMMDD) for maximum release flexibility
    • PATCH increments for multiple releases on the same day
    • ScalVer is SemVer-compatible, so existing tooling continues to work
    • Provides clear calendar-based release tracking while maintaining compatibility guarantees

Why ScalVer?

ScalVer offers the best of both worlds:

  • Time-based clarity: Know exactly when a release was made from the version number
  • SemVer compatibility: All existing package managers and tooling work unchanged
  • Flexible cadence: Daily format allows for rapid iteration and hotfixes
  • Breaking change tracking: MAJOR version still signals breaking changes
  • Tool support: Every ScalVer tag is syntactically valid SemVer 2.0

For more information about ScalVer, visit scalver.org.

Async Reliability

  • Replaced manual event loop creation with utils.run_async, letting synchronous APIs safely reuse running loops in notebooks and web frameworks.
  • Added Jupyter-aware fallbacks (nest_asyncio) and clearer guidance when sync methods are invoked from async contexts.
  • Published utils.maybe_await to normalize sync/async callables across helpers.

Input Validation Guardrails

  • Introduced onellm.validators to enforce safe ranges for temperature, token limits, penalties, stop sequences, and related parameters.
  • Added provider-aware model validation so invalid OpenAI, Anthropic, Mistral, and other model names fail fast with actionable messages.

Secure File Handling & Streaming

  • Hardened File.upload/File.aupload by sanitizing filenames, enforcing extension and MIME allowlists, and streaming-safe size limits for files, bytes, and file-like objects.
  • Propagated validated filenames through every provider while closing directory traversal, TOCTOU, and race-condition gaps surfaced in review.
  • Stabilized Amazon Bedrock streaming with aligned async usage, higher timeouts, and queue handling fixes.

Testing & Security Evidence

  • Added dedicated unit and integration coverage for async helper behavior, file validation, and provider upload regressions.
  • Captured proactive security scan results and remediation reports documenting the hardening work.

What's Changed

  • fix: resolve event loop management issues for async/sync interoperability by @ranaroussi in #1
  • feat: add comprehensive input validation for API parameters by @ranaroussi in #3
  • security: add comprehensive file path validation to prevent attacks by @ranaroussi in #2
  • feat: Adopt ScalVer versioning scheme (0.20251008.0) by @ranaroussi in #4

New Contributors

Full Changelog: 0.1.4...0.20251008.0

0.1.4

02 Oct 09:46

Choose a tag to compare

New Providers

  • Vercel AI Gateway: Added OpenAI-compatible provider for Vercel AI Gateway
    • Access 100+ models from OpenAI, Anthropic, Google, Meta, xAI, Mistral, DeepSeek, and more
    • API Base: https://ai-gateway.vercel.sh/v1
    • Model naming: vercel/vendor/model (e.g., vercel/openai/gpt-4o-mini, vercel/anthropic/claude-sonnet-4)
    • Supports streaming, JSON mode, function calling, and vision capabilities
    • Authentication via VERCEL_AI_API_KEY environment variable

Model Updates (2025 Releases)

  • OpenAI: Added GPT-5 family (gpt-5, gpt-5-pro, gpt-5-mini, gpt-5-nano)
  • Anthropic: Added Claude 4 family (claude-sonnet-4.5, claude-opus-4.1, claude-sonnet-4, claude-opus-4)
  • Google: Added Gemini 2.5 family (gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.5-flash-image)
  • Mistral: Added specialized models (codestral, pixtral, devstral, voxtral, ministral)

Documentation

  • Updated provider count from 20 to 21 across all documentation
  • Added comprehensive provider documentation with model lists
  • Added Vercel setup guide and examples

0.1.3

17 Sep 21:54

Choose a tag to compare

Enhancements

  • Cache-Aware Usage Metrics: Extended UsageInfo with *_cached/*_uncached counts while keeping totals intact for billing parity.
    • OpenAI adapter now surfaces cache hits via prompt_tokens_details.cached_tokens.
    • Anthropic adapter maps cache_read_input_tokens / cache_creation_input_tokens into the unified schema.
    • All consumers continue to receive total_tokens plus new fields defaulting to 0 when providers omit cache data.

New Providers

  • GLM (Z.AI): Added OpenAI-compatible provider targeting https://api.z.ai/api/paas/v4.
    • Enables access to GLM-4 model family with streaming, JSON mode, tool calling, and vision support.
    • Reads credentials from GLM_API_KEY or the ZAI_API_KEY environment variable.

Maintenance

  • Adjusted configuration loader to accept multiple environment variable aliases per provider.
  • Added focused unit tests covering cache usage normalization and GLM provider initialization.

Full Changelog: 0.1.2...0.1.3

0.1.2

09 Sep 10:15

Choose a tag to compare

OpenAI Provider Parameter Updates

  • Fixed compatibility issues with newer OpenAI models
    • Automatically converts max_tokens to max_completion_tokens for all OpenAI models
    • Removes temperature parameter for GPT-5 and o-series models that only support default temperature
    • Ensures compatibility with GPT-5, o1, o3, and future OpenAI model releases
    • Backward compatible - existing code using max_tokens continues to work without changes

Technical Details

  • Models starting with gpt-5 or o now have temperature parameter automatically removed
  • All OpenAI API calls now use max_completion_tokens instead of deprecated max_tokens
  • Changes are transparent to users - no code modifications required

Full Changelog: 0.1.1...0.1.2