Releases: muxi-ai/onellm
v0.20260130.0
OneLLM v0.20260130.0
v0.20260127.0
OneLLM v0.20260127.0
0.20251222.0
0.20251222.0 - Semantic Cache Improvements
Status: Development Status :: 5 - Production/Stable
Bug Fixes
Semantic Cache False Positives
Fixed critical issues with the semantic cache that caused incorrect cache matches:
-
System Prompt Hash Matching: The semantic cache now includes a hash of the system prompt when matching cached responses. Previously, different LLM operations with similar user messages but different system prompts could incorrectly return cached responses from unrelated operations.
-
Short Text Exclusion: Messages shorter than 128 characters are now excluded from semantic matching (configurable via
min_text_length). Short questions have misleadingly high semantic similarity scores which caused false cache hits. These short messages still benefit from exact hash matching. -
Stricter Default Threshold: Default similarity threshold increased from 0.95 to 0.98 for more reliable matching.
Changes
- Added
_extract_system_hash()method to compute SHA256 hash of system prompt content - Modified
_semantic_search()to require both semantic similarity AND system hash match - Added configurable
min_text_lengthparameter (default: 128 chars) before semantic cache operations - Changed default
similarity_thresholdfrom 0.95 to 0.98 - Added
cachingparameter toChatCompletion.create/acreatefor per-call cache bypass
Full Changelog: 0.20251218.0...0.20251222.0
0.20251218.0
Breaking Changes
This release contains breaking changes to exception class names. These changes improve Python compatibility and brand consistency.
Exception Renames (PR #9 - Python Builtin Shadowing Fix)
The following exceptions were renamed to avoid shadowing Python's built-in exception names:
| Old Name | New Name |
|---|---|
TimeoutError |
RequestTimeoutError |
PermissionError |
PermissionDeniedError |
Migration:
# Before
from onellm.exceptions import TimeoutError, PermissionError
try:
response = client.chat.completions.create(...)
except TimeoutError:
print("Request timed out")
except PermissionError:
print("Permission denied")
# After
from onellm.exceptions import RequestTimeoutError, PermissionDeniedError
try:
response = client.chat.completions.create(...)
except RequestTimeoutError:
print("Request timed out")
except PermissionDeniedError:
print("Permission denied")Base Exception Rename (PR #10 - Brand Consistency)
The base exception class was renamed for brand consistency:
| Old Name | New Name |
|---|---|
MuxiLLMError |
OneLLMError |
Migration:
# Before
from onellm.exceptions import MuxiLLMError
try:
response = client.chat.completions.create(...)
except MuxiLLMError as e:
print(f"OneLLM error: {e}")
# After
from onellm.exceptions import OneLLMError
try:
response = client.chat.completions.create(...)
except OneLLMError as e:
print(f"OneLLM error: {e}")Improvements
- Exception Chaining: All exceptions now use proper exception chaining (
raise ... from e) for better debugging and stack traces - Test Suite Fixes: Fixed test suite issues including state pollution between tests and improved mocking patterns
Technical Details
- Exceptions no longer shadow Python builtins, preventing subtle bugs when catching exceptions
- All 373 unit tests passing with improved test isolation
- Exception hierarchy remains unchanged - only class names were updated
What's Changed
- Fix builtin shadowing and exception handling bugs by @Copilot in #9
New Contributors
- @Copilot made their first contribution in #9
Full Changelog: 0.20251121.0...0.20251218.0
0.20251121.0
MiniMax Provider Support
We're excited to announce support for MiniMax's M2 model series with advanced reasoning capabilities! This release introduces a new provider and establishes a reusable architecture for Anthropic-compatible APIs.
🆕 What's New
MiniMax Provider
Access MiniMax's powerful M2 models through OneLLM's unified interface:
-
Two model variants:
minimax/MiniMax-M2- Agentic capabilities with advanced reasoningminimax/MiniMax-M2-Stable- Optimized for high concurrency and commercial use
-
Key capabilities:
- 🧠 Interleaved thinking - Enable step-by-step reasoning for complex tasks
- 🔧 Tool calling - Function calling support for agentic workflows
- ⚡ Streaming - Real-time token-by-token responses
- 🌍 Global availability - International and China endpoints
Quick Start
from onellm import ChatCompletion
# Basic chat completion
response = ChatCompletion.create(
model="minimax/MiniMax-M2",
messages=[{"role": "user", "content": "Explain quantum computing"}],
max_tokens=1000
)
# Advanced reasoning with interleaved thinking
response = ChatCompletion.create(
model="minimax/MiniMax-M2",
messages=[{"role": "user", "content": "Solve: A train travels 60mph for 2hrs, then 80mph for 3hrs. Total distance?"}],
max_tokens=500,
thinking={"enabled": True, "budget_tokens": 20000}
)Configuration
# Set your API key
export MINMAX_API_KEY="your-api-key"
# For users in China, configure the China endpoint
export ONELLM_PROVIDERS__MINIMAX__API_BASE=https://api.minimaxi.com/anthropicGet your API key at: https://platform.minimax.io/
🏗️ Architecture Improvements
This release introduces AnthropicCompatibleProvider - a new base class that makes it easy to integrate providers offering Anthropic-compatible APIs. This mirrors our successful OpenAICompatibleProvider pattern and opens the door for rapid integration of future Anthropic-compatible services.
Benefits:
- ✅ Minimal code duplication (~65 lines vs ~850 lines)
- ✅ Consistent behavior across compatible providers
- ✅ Automatic inheritance of all Anthropic features
- ✅ Zero breaking changes to existing code
📊 By The Numbers
- Total Providers: 22 (up from 21)
- New Tests: 15 comprehensive unit tests
- Test Coverage: 100% for new code
- New Files: 4 (provider, base class, tests, examples)
- Lines Added: 872
📚 Documentation
- Usage Examples: See
examples/providers/minimax_example.py - Provider Documentation: Updated
examples/providers/README.md - Full Changelog: See
CHANGELOG.md
🔗 Resources
- MiniMax Documentation: https://platform.minimax.io/docs/
- API Reference: https://platform.minimax.io/docs/api-reference/text-anthropic-api
- Get API Key: https://platform.minimax.io/
🙏 Acknowledgments
Special thanks to the MiniMax team for providing an Anthropic-compatible API, making this integration straightforward and maintaining consistency across the LLM ecosystem.
🚀 What's Next?
With the new AnthropicCompatibleProvider architecture in place, we're ready to quickly add more Anthropic-compatible providers. Stay tuned for more updates!
Full Changelog: v0.20251013.0...v0.20251121.0
0.20251013.0
Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching
- 42,000-143,000x faster responses: Cache hits return in ~7µs vs 300-1000ms for API calls
- 50-80% cost savings: Dramatically reduces API costs through intelligent caching
- Zero ongoing API costs: Uses local multilingual embedding model (
paraphrase-multilingual-MiniLM-L12-v2) - Two-tier matching: Hash-based exact matching (~2µs) with semantic similarity fallback (~18ms)
- Streaming support: Artificial streaming for cached responses preserves natural UX
- TTL with refresh-on-access: Configurable time-to-live (default: 86400s / 1 day)
- 50+ language support: Multilingual semantic matching out of the box
- LRU eviction: Memory-bounded with configurable max entries (default: 1000)
Cache Configuration
import onellm
# Initialize semantic cache
onellm.init_cache(
max_entries=1000, # Maximum cache entries
p=0.95, # Similarity threshold (0-1)
hash_only=False, # Enable semantic matching
stream_chunk_strategy="words", # Streaming chunking: words/sentences/paragraphs/characters
stream_chunk_length=8, # Chunks per yield
ttl=86400 # Time-to-live in seconds (1 day)
)
# Use cache with any provider
response = onellm.ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Cache management
stats = onellm.cache_stats() # Get hit/miss/entries stats
onellm.clear_cache() # Clear all entries
onellm.disable_cache() # Disable cachingPerformance Benchmarks
- Hash exact match: ~2µs (2,000,000% faster than API)
- Semantic match: ~18ms (1,500-5,000% faster than API)
- Typical API call: 300-1000ms
- Streaming simulation: Instant cached response with natural chunked delivery
- Model download: One-time 118MB download (~13s on first init)
Technical Details
- Dependencies: Added
sentence-transformers>=2.0.0andfaiss-cpu>=1.7.0to core dependencies - Memory-only: In-memory cache for long-running processes (no persistence)
- Thread-safe: OrderedDict-based LRU with atomic operations
- Streaming chunking: Four strategies (words, sentences, paragraphs, characters) for natural streaming UX
- TTL refresh: Cache hits refresh TTL, keeping frequently-used entries alive
- Hash key filtering: Excludes non-semantic parameters (
stream,timeout,metadata) from cache key
Documentation
- New docs: Comprehensive
docs/caching.mdwith architecture, usage, and best practices - Updated README: Highlighted semantic caching in Key Features and Advanced Features
- Updated docs: Added caching to
docs/README.md,docs/advanced-features.md, anddocs/quickstart.md - Examples: Added
examples/cache_example.pydemonstrating all cache features
Use Cases
Ideal for:
- High-traffic web applications with repeated queries
- Interactive demos and chatbots
- Development and testing environments
- API cost optimization
- Latency-sensitive applications
Limited for:
- Stateless serverless functions (short-lived processes)
- Highly unique, non-repetitive queries
- Contexts requiring strict data freshness
What's Changed
- feat: add blazing-fast semantic caching with multilingual support by @ranaroussi in #5
Full Changelog: 0.20251008.0...0.20251013.0
0.20251008.0
0.20251008.0 - ScalVer Adoption
Status: Development Status :: 5 - Production/Stable
Versioning Change
- ScalVer Adoption: OneLLM now uses ScalVer (Scalable Calendar Versioning) instead of Semantic Versioning
- Version format:
MAJOR.YYYYMMDD.PATCH(daily cadence) - Current version:
0.20251008.0(October 8, 2025) - MAJOR = 0 indicates alpha/experimental status per ScalVer convention
- DATE segment uses daily format (YYYYMMDD) for maximum release flexibility
- PATCH increments for multiple releases on the same day
- ScalVer is SemVer-compatible, so existing tooling continues to work
- Provides clear calendar-based release tracking while maintaining compatibility guarantees
- Version format:
Why ScalVer?
ScalVer offers the best of both worlds:
- Time-based clarity: Know exactly when a release was made from the version number
- SemVer compatibility: All existing package managers and tooling work unchanged
- Flexible cadence: Daily format allows for rapid iteration and hotfixes
- Breaking change tracking: MAJOR version still signals breaking changes
- Tool support: Every ScalVer tag is syntactically valid SemVer 2.0
For more information about ScalVer, visit scalver.org.
Async Reliability
- Replaced manual event loop creation with
utils.run_async, letting synchronous APIs safely reuse running loops in notebooks and web frameworks. - Added Jupyter-aware fallbacks (
nest_asyncio) and clearer guidance when sync methods are invoked from async contexts. - Published
utils.maybe_awaitto normalize sync/async callables across helpers.
Input Validation Guardrails
- Introduced
onellm.validatorsto enforce safe ranges for temperature, token limits, penalties, stop sequences, and related parameters. - Added provider-aware model validation so invalid OpenAI, Anthropic, Mistral, and other model names fail fast with actionable messages.
Secure File Handling & Streaming
- Hardened
File.upload/File.auploadby sanitizing filenames, enforcing extension and MIME allowlists, and streaming-safe size limits for files, bytes, and file-like objects. - Propagated validated filenames through every provider while closing directory traversal, TOCTOU, and race-condition gaps surfaced in review.
- Stabilized Amazon Bedrock streaming with aligned async usage, higher timeouts, and queue handling fixes.
Testing & Security Evidence
- Added dedicated unit and integration coverage for async helper behavior, file validation, and provider upload regressions.
- Captured proactive security scan results and remediation reports documenting the hardening work.
What's Changed
- fix: resolve event loop management issues for async/sync interoperability by @ranaroussi in #1
- feat: add comprehensive input validation for API parameters by @ranaroussi in #3
- security: add comprehensive file path validation to prevent attacks by @ranaroussi in #2
- feat: Adopt ScalVer versioning scheme (0.20251008.0) by @ranaroussi in #4
New Contributors
- @ranaroussi made their first contribution in #1
Full Changelog: 0.1.4...0.20251008.0
0.1.4
New Providers
- Vercel AI Gateway: Added OpenAI-compatible provider for Vercel AI Gateway
- Access 100+ models from OpenAI, Anthropic, Google, Meta, xAI, Mistral, DeepSeek, and more
- API Base:
https://ai-gateway.vercel.sh/v1 - Model naming:
vercel/vendor/model(e.g.,vercel/openai/gpt-4o-mini,vercel/anthropic/claude-sonnet-4) - Supports streaming, JSON mode, function calling, and vision capabilities
- Authentication via
VERCEL_AI_API_KEYenvironment variable
Model Updates (2025 Releases)
- OpenAI: Added GPT-5 family (gpt-5, gpt-5-pro, gpt-5-mini, gpt-5-nano)
- Anthropic: Added Claude 4 family (claude-sonnet-4.5, claude-opus-4.1, claude-sonnet-4, claude-opus-4)
- Google: Added Gemini 2.5 family (gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite, gemini-2.5-flash-image)
- Mistral: Added specialized models (codestral, pixtral, devstral, voxtral, ministral)
Documentation
- Updated provider count from 20 to 21 across all documentation
- Added comprehensive provider documentation with model lists
- Added Vercel setup guide and examples
0.1.3
Enhancements
- Cache-Aware Usage Metrics: Extended
UsageInfowith*_cached/*_uncachedcounts while keeping totals intact for billing parity.- OpenAI adapter now surfaces cache hits via
prompt_tokens_details.cached_tokens. - Anthropic adapter maps
cache_read_input_tokens/cache_creation_input_tokensinto the unified schema. - All consumers continue to receive
total_tokensplus new fields defaulting to 0 when providers omit cache data.
- OpenAI adapter now surfaces cache hits via
New Providers
- GLM (Z.AI): Added OpenAI-compatible provider targeting
https://api.z.ai/api/paas/v4.- Enables access to GLM-4 model family with streaming, JSON mode, tool calling, and vision support.
- Reads credentials from
GLM_API_KEYor theZAI_API_KEYenvironment variable.
Maintenance
- Adjusted configuration loader to accept multiple environment variable aliases per provider.
- Added focused unit tests covering cache usage normalization and GLM provider initialization.
Full Changelog: 0.1.2...0.1.3
0.1.2
OpenAI Provider Parameter Updates
- Fixed compatibility issues with newer OpenAI models
- Automatically converts
max_tokenstomax_completion_tokensfor all OpenAI models - Removes
temperatureparameter for GPT-5 and o-series models that only support default temperature - Ensures compatibility with GPT-5, o1, o3, and future OpenAI model releases
- Backward compatible - existing code using
max_tokenscontinues to work without changes
- Automatically converts
Technical Details
- Models starting with
gpt-5oronow have temperature parameter automatically removed - All OpenAI API calls now use
max_completion_tokensinstead of deprecatedmax_tokens - Changes are transparent to users - no code modifications required
Full Changelog: 0.1.1...0.1.2