System Prompt Hash Matching: The semantic cache now includes a hash of the system prompt when matching cached responses. Previously, different LLM operations with similar user messages but different system prompts could incorrectly return cached responses from unrelated operations.
Short Text Exclusion: Messages shorter than 128 characters are now excluded from semantic matching (configurable via min_text_length). Short questions have misleadingly high semantic similarity scores which caused false cache hits. These short messages still benefit from exact hash matching.
Stricter Default Threshold: Default similarity threshold increased from 0.95 to 0.98 for more reliable matching.

Changes

Added _extract_system_hash() method to compute SHA256 hash of system prompt content
Modified _semantic_search() to require both semantic similarity AND system hash match
Added configurable min_text_length parameter (default: 128 chars) before semantic cache operations
Changed default similarity_threshold from 0.95 to 0.98
Added caching parameter to ChatCompletion.create/acreate for per-call cache bypass

Full Changelog: 0.20251218.0...0.20251222.0

Assets 2

22 Dec 00:01

ranaroussi

0.20251218.0

7e23457

0.20251218.0

Breaking Changes

This release contains breaking changes to exception class names. These changes improve Python compatibility and brand consistency.

Exception Renames (PR #9 - Python Builtin Shadowing Fix)

The following exceptions were renamed to avoid shadowing Python's built-in exception names:

Old Name	New Name
`TimeoutError`	`RequestTimeoutError`
`PermissionError`	`PermissionDeniedError`

Migration:

# Before
from onellm.exceptions import TimeoutError, PermissionError

try:
    response = client.chat.completions.create(...)
except TimeoutError:
    print("Request timed out")
except PermissionError:
    print("Permission denied")

# After
from onellm.exceptions import RequestTimeoutError, PermissionDeniedError

try:
    response = client.chat.completions.create(...)
except RequestTimeoutError:
    print("Request timed out")
except PermissionDeniedError:
    print("Permission denied")

Base Exception Rename (PR #10 - Brand Consistency)

The base exception class was renamed for brand consistency:

Old Name	New Name
`MuxiLLMError`	`OneLLMError`

Migration:

# Before
from onellm.exceptions import MuxiLLMError

try:
    response = client.chat.completions.create(...)
except MuxiLLMError as e:
    print(f"OneLLM error: {e}")

# After
from onellm.exceptions import OneLLMError

try:
    response = client.chat.completions.create(...)
except OneLLMError as e:
    print(f"OneLLM error: {e}")

Improvements

Exception Chaining: All exceptions now use proper exception chaining (raise ... from e) for better debugging and stack traces
Test Suite Fixes: Fixed test suite issues including state pollution between tests and improved mocking patterns

Technical Details

Exceptions no longer shadow Python builtins, preventing subtle bugs when catching exceptions
All 373 unit tests passing with improved test isolation
Exception hierarchy remains unchanged - only class names were updated

What's Changed

Fix builtin shadowing and exception handling bugs by @Copilot in #9

New Contributors

@Copilot made their first contribution in #9

Full Changelog: 0.20251121.0...0.20251218.0

Assets 2

21 Nov 10:33

ranaroussi

0.20251121.0

6da419d

0.20251121.0

MiniMax Provider Support

We're excited to announce support for MiniMax's M2 model series with advanced reasoning capabilities! This release introduces a new provider and establishes a reusable architecture for Anthropic-compatible APIs.

🆕 What's New

MiniMax Provider

Access MiniMax's powerful M2 models through OneLLM's unified interface:

Two model variants:
- minimax/MiniMax-M2 - Agentic capabilities with advanced reasoning
- minimax/MiniMax-M2-Stable - Optimized for high concurrency and commercial use
Key capabilities:
- 🧠 Interleaved thinking - Enable step-by-step reasoning for complex tasks
- 🔧 Tool calling - Function calling support for agentic workflows
- ⚡ Streaming - Real-time token-by-token responses
- 🌍 Global availability - International and China endpoints

Quick Start

from onellm import ChatCompletion

# Basic chat completion
response = ChatCompletion.create(
    model="minimax/MiniMax-M2",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    max_tokens=1000
)

# Advanced reasoning with interleaved thinking
response = ChatCompletion.create(
    model="minimax/MiniMax-M2",
    messages=[{"role": "user", "content": "Solve: A train travels 60mph for 2hrs, then 80mph for 3hrs. Total distance?"}],
    max_tokens=500,
    thinking={"enabled": True, "budget_tokens": 20000}
)

Configuration

# Set your API key
export MINMAX_API_KEY="your-api-key"

# For users in China, configure the China endpoint
export ONELLM_PROVIDERS__MINIMAX__API_BASE=https://api.minimaxi.com/anthropic

Get your API key at: https://platform.minimax.io/

🏗️ Architecture Improvements

This release introduces AnthropicCompatibleProvider - a new base class that makes it easy to integrate providers offering Anthropic-compatible APIs. This mirrors our successful OpenAICompatibleProvider pattern and opens the door for rapid integration of future Anthropic-compatible services.

Benefits:

✅ Minimal code duplication (~65 lines vs ~850 lines)
✅ Consistent behavior across compatible providers
✅ Automatic inheritance of all Anthropic features
✅ Zero breaking changes to existing code

📊 By The Numbers

Total Providers: 22 (up from 21)
New Tests: 15 comprehensive unit tests
Test Coverage: 100% for new code
New Files: 4 (provider, base class, tests, examples)
Lines Added: 872

📚 Documentation

Usage Examples: See examples/providers/minimax_example.py
Provider Documentation: Updated examples/providers/README.md
Full Changelog: See CHANGELOG.md

🔗 Resources

MiniMax Documentation: https://platform.minimax.io/docs/
API Reference: https://platform.minimax.io/docs/api-reference/text-anthropic-api
Get API Key: https://platform.minimax.io/

🙏 Acknowledgments

Special thanks to the MiniMax team for providing an Anthropic-compatible API, making this integration straightforward and maintaining consistency across the LLM ecosystem.

🚀 What's Next?

With the new AnthropicCompatibleProvider architecture in place, we're ready to quickly add more Anthropic-compatible providers. Stay tuned for more updates!

Full Changelog: v0.20251013.0...v0.20251121.0

Assets 2

13 Oct 16:52

ranaroussi

0.20251013.0

966d1aa

0.20251013.0

Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching

42,000-143,000x faster responses: Cache hits return in ~7µs vs 300-1000ms for API calls
50-80% cost savings: Dramatically reduces API costs through intelligent caching
Zero ongoing API costs: Uses local multilingual embedding model (paraphrase-multilingual-MiniLM-L12-v2)
Two-tier matching: Hash-based exact matching (~2µs) with semantic similarity fallback (~18ms)
Streaming support: Artificial streaming for cached responses preserves natural UX
TTL with refresh-on-access: Configurable time-to-live (default: 86400s / 1 day)
50+ language support: Multilingual semantic matching out of the box
LRU eviction: Memory-bounded with configurable max entries (default: 1000)

Cache Configuration

import onellm

# Initialize semantic cache
onellm.init_cache(
    max_entries=1000,           # Maximum cache entries
    p=0.95,                     # Similarity threshold (0-1)
    hash_only=False,            # Enable semantic matching
    stream_chunk_strategy="words",  # Streaming chunking: words/sentences/paragraphs/characters
    stream_chunk_length=8,      # Chunks per yield
    ttl=86400                   # Time-to-live in seconds (1 day)
)

# Use cache with any provider
response = onellm.ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Cache management
stats = onellm.cache_stats()    # Get hit/miss/entries stats
onellm.clear_cache()            # Clear all entries
onellm.disable_cache()          # Disable caching

Performance Benchmarks

Hash exact match: ~2µs (2,000,000% faster than API)
Semantic match: ~18ms (1,500-5,000% faster than API)
Typical API call: 300-1000ms
Streaming simulation: Instant cached response with natural chunked delivery
Model download: One-time 118MB download (~13s on first init)

Technical Details

Dependencies: Added sentence-transformers>=2.0.0 and faiss-cpu>=1.7.0 to core dependencies
Memory-only: In-memory cache for long-running processes (no persistence)
Thread-safe: OrderedDict-based LRU with atomic operations
Streaming chunking: Four strategies (words, sentences, paragraphs, characters) for natural streaming UX
TTL refresh: Cache hits refresh TTL, keeping frequently-used entries alive
Hash key filtering: Excludes non-semantic parameters (stream, timeout, metadata) from cache key

Documentation

New docs: Comprehensive docs/caching.md with architecture, usage, and best practices
Updated README: Highlighted semantic caching in Key Features and Advanced Features
Updated docs: Added caching to docs/README.md, docs/advanced-features.md, and docs/quickstart.md
Examples: Added examples/cache_example.py demonstrating all cache features

Use Cases

Ideal for:

High-traffic web applications with repeated queries
Interactive demos and chatbots
Development and testing environments
API cost optimization
Latency-sensitive applications

Limited for:

Stateless serverless functions (short-lived processes)
Highly unique, non-repetitive queries
Contexts requiring strict data freshness

What's Changed

feat: add blazing-fast semantic caching with multilingual support by @ranaroussi in #5

Full Changelog: 0.20251008.0...0.20251013.0

Contributors

ranaroussi

Assets 2

08 Oct 23:32

ranaroussi

0.20251008.0

916291b

0.20251008.0

0.20251008.0 - ScalVer Adoption

Status: Development Status :: 5 - Production/Stable

Versioning Change

ScalVer Adoption: OneLLM now uses ScalVer (Scalable Calendar Versioning) instead of Semantic Versioning
- Version format: MAJOR.YYYYMMDD.PATCH (daily cadence)
- Current version: 0.20251008.0 (October 8, 2025)
- MAJOR = 0 indicates alpha/experimental status per ScalVer convention
- DATE segment uses daily format (YYYYMMDD) for maximum release flexibility
- PATCH increments for multiple releases on the same day
- ScalVer is SemVer-compatible, so existing tooling continues to work
- Provides clear calendar-based release tracking while maintaining compatibility guarantees

Why ScalVer?

ScalVer offers the best of both worlds:

Time-based clarity: Know exactly when a release was made from the version number
SemVer compatibility: All existing package managers and tooling work unchanged
Flexible cadence: Daily format allows for rapid iteration and hotfixes
Breaking change tracking: MAJOR version still signals breaking changes
Tool support: Every ScalVer tag is syntactically valid SemVer 2.0

For more information about ScalVer, visit scalver.org.

Async Reliability

Replaced manual event loop creation with utils.run_async, letting synchronous APIs safely reuse running loops in notebooks and web frameworks.
Added Jupyter-aware fallbacks (nest_asyncio) and clearer guidance when sync methods are invoked from async contexts.
Published utils.maybe_await to normalize sync/async callables across helpers.

Input Validation Guardrails

Introduced onellm.validators to enforce safe ranges for temperature, token limits, penalties, stop sequences, and related parameters.
Added provider-aware model validation so invalid OpenAI, Anthropic, Mistral, and other model names fail fast with actionable messages.

Secure File Handling & Streaming

Hardened File.upload/File.aupload by sanitizing filenames, enforcing extension and MIME allowlists, and streaming-safe size limits for files, bytes, and file-like objects.
Propagated validated filenames through every provider while closing directory traversal, TOCTOU, and race-condition gaps surfaced in review.
Stabilized Amazon Bedrock streaming with aligned async usage, higher timeouts, and queue handling fixes.

Testing & Security Evidence

Added dedicated unit and integration coverage for async helper behavior, file validation, and provider upload regressions.
Captured proactive security scan results and remediation reports documenting the hardening work.

What's Changed

fix: resolve event loop management issues for async/sync interoperability by @ranaroussi in #1
feat: add comprehensive input validation for API parameters by @ranaroussi in #3
security: add comprehensive file path validation to prevent attacks by @ranaroussi in #2
feat: Adopt ScalVer versioning scheme (0.20251008.0) by @ranaroussi in #4

New Contributors

@ranaroussi made their first contribution in #1

Full Changelog: 0.1.4...0.20251008.0

Contributors

ranaroussi

Assets 2

02 Oct 09:46

ranaroussi

0.1.4

e1d2177

0.1.4

New Providers

Vercel AI Gateway: Added OpenAI-compatible provider for Vercel AI Gateway
- Access 100+ models from OpenAI, Anthropic, Google, Meta, xAI, Mistral, DeepSeek, and more
- API Base: https://ai-gateway.vercel.sh/v1
- Model naming: vercel/vendor/model (e.g., vercel/openai/gpt-4o-mini, vercel/anthropic/claude-sonnet-4)
- Supports streaming, JSON mode, function calling, and vision capabilities
- Authentication via VERCEL_AI_API_KEY environment variable

Model Updates (2025 Releases)

OpenAI: Added GPT-5 family (gpt-5, gpt-5-pro, gpt-5-mini, gpt-5-nano)
Anthropic: Added Claude 4 family (claude-sonnet-4.5, claude-opus-4.1, claude-sonnet-4, claude-opus-4)
Google: Added Gemini 2.5 family (gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.5-flash-image)
Mistral: Added specialized models (codestral, pixtral, devstral, voxtral, ministral)

Documentation

Updated provider count from 20 to 21 across all documentation
Added comprehensive provider documentation with model lists
Added Vercel setup guide and examples

Assets 2

17 Sep 21:54

ranaroussi

0.1.3

aca24fb

0.1.3

Enhancements

Cache-Aware Usage Metrics: Extended UsageInfo with *_cached/*_uncached counts while keeping totals intact for billing parity.
- OpenAI adapter now surfaces cache hits via prompt_tokens_details.cached_tokens.
- Anthropic adapter maps cache_read_input_tokens / cache_creation_input_tokens into the unified schema.
- All consumers continue to receive total_tokens plus new fields defaulting to 0 when providers omit cache data.

New Providers

GLM (Z.AI): Added OpenAI-compatible provider targeting https://api.z.ai/api/paas/v4.
- Enables access to GLM-4 model family with streaming, JSON mode, tool calling, and vision support.
- Reads credentials from GLM_API_KEY or the ZAI_API_KEY environment variable.

Maintenance

Adjusted configuration loader to accept multiple environment variable aliases per provider.
Added focused unit tests covering cache usage normalization and GLM provider initialization.

Full Changelog: 0.1.2...0.1.3

Assets 2

09 Sep 10:15

ranaroussi

0.1.2

b13140b

0.1.2

OpenAI Provider Parameter Updates

Fixed compatibility issues with newer OpenAI models
- Automatically converts max_tokens to max_completion_tokens for all OpenAI models
- Removes temperature parameter for GPT-5 and o-series models that only support default temperature
- Ensures compatibility with GPT-5, o1, o3, and future OpenAI model releases
- Backward compatible - existing code using max_tokens continues to work without changes

Technical Details

Models starting with gpt-5 or o now have temperature parameter automatically removed
All OpenAI API calls now use max_completion_tokens instead of deprecated max_tokens
Changes are transparent to users - no code modifications required

Full Changelog: 0.1.1...0.1.2

Assets 2

Releases: muxi-ai/onellm

v0.20260130.0

Uh oh!

v0.20260127.0

Uh oh!

0.20251222.0

0.20251222.0 - Semantic Cache Improvements

Bug Fixes

Semantic Cache False Positives

Changes

Uh oh!

0.20251218.0

Breaking Changes

Exception Renames (PR #9 - Python Builtin Shadowing Fix)

Base Exception Rename (PR #10 - Brand Consistency)

Improvements

Technical Details

What's Changed

New Contributors

Uh oh!

0.20251121.0

MiniMax Provider Support

🆕 What's New

MiniMax Provider

Quick Start

Configuration

🏗️ Architecture Improvements

📊 By The Numbers

📚 Documentation

🔗 Resources

🙏 Acknowledgments

🚀 What's Next?

Uh oh!

0.20251013.0

Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching

Cache Configuration

Performance Benchmarks

Technical Details

Documentation

Use Cases

What's Changed

Contributors

Uh oh!

0.20251008.0

0.20251008.0 - ScalVer Adoption

Versioning Change

Why ScalVer?

Async Reliability

Input Validation Guardrails

Secure File Handling & Streaming

Testing & Security Evidence

What's Changed

New Contributors

Contributors

Uh oh!

0.1.4

New Providers

Model Updates (2025 Releases)

Documentation

Uh oh!

0.1.3

Enhancements

New Providers

Maintenance

Uh oh!

0.1.2

OpenAI Provider Parameter Updates

Technical Details

Uh oh!