|
1 | | -# optillm |
| 1 | +# OptiLLM |
2 | 2 |
|
3 | | -optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs. The current focus is on implementing techniques that improve reasoning over coding, logical and mathematical queries. |
| 3 | +<p align="center"> |
| 4 | + <img src="optillm-logo.png" alt="OptiLLM Logo" width="200" /> |
| 5 | +</p> |
| 6 | + |
| 7 | +<p align="center"> |
| 8 | + <strong>🚀 2-10x accuracy improvements on reasoning tasks with zero training</strong> |
| 9 | +</p> |
| 10 | + |
| 11 | +<p align="center"> |
| 12 | + <a href="https://github.com/codelion/optillm/stargazers"><img src="https://img.shields.io/github/stars/codelion/optillm?style=social" alt="GitHub stars"></a> |
| 13 | + <a href="https://pypi.org/project/optillm/"><img src="https://img.shields.io/pypi/v/optillm" alt="PyPI version"></a> |
| 14 | + <a href="https://pypi.org/project/optillm/"><img src="https://img.shields.io/pypi/dm/optillm" alt="PyPI downloads"></a> |
| 15 | + <a href="https://github.com/codelion/optillm/blob/main/LICENSE"><img src="https://img.shields.io/github/license/codelion/optillm" alt="License"></a> |
| 16 | +</p> |
| 17 | + |
| 18 | +<p align="center"> |
| 19 | + <a href="https://huggingface.co/spaces/codelion/optillm">🤗 HuggingFace Space</a> • |
| 20 | + <a href="https://colab.research.google.com/drive/1SpuUb8d9xAoTh32M-9wJsB50AOH54EaH?usp=sharing">📓 Colab Demo</a> • |
| 21 | + <a href="https://github.com/codelion/optillm/discussions">💬 Discussions</a> |
| 22 | +</p> |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +**OptiLLM** is an OpenAI API-compatible optimizing inference proxy that implements 20+ state-of-the-art techniques to dramatically improve LLM accuracy and performance on reasoning tasks - without requiring any model training or fine-tuning. |
4 | 27 |
|
5 | 28 | It is possible to beat the frontier models using these techniques across diverse tasks by doing additional compute at inference time. A good example of how to combine such techniques together is the [CePO approach](optillm/cepo) from Cerebras. |
6 | 29 |
|
7 | | -[](https://huggingface.co/spaces/codelion/optillm) |
8 | | -[](https://colab.research.google.com/drive/1SpuUb8d9xAoTh32M-9wJsB50AOH54EaH?usp=sharing) |
9 | | -[](https://github.com/codelion/optillm/discussions) |
| 30 | +## ✨ Key Features |
| 31 | + |
| 32 | +- **🎯 Instant Improvements**: 2-10x better accuracy on math, coding, and logical reasoning |
| 33 | +- **🔌 Drop-in Replacement**: Works with any OpenAI-compatible API endpoint |
| 34 | +- **🧠 20+ Optimization Techniques**: From simple best-of-N to advanced MCTS and planning |
| 35 | +- **📦 Zero Training Required**: Just proxy your existing API calls through OptiLLM |
| 36 | +- **⚡ Production Ready**: Used in production by companies and researchers worldwide |
| 37 | +- **🌍 Multi-Provider**: Supports OpenAI, Anthropic, Google, Cerebras, and 100+ models via LiteLLM |
| 38 | + |
| 39 | +## 🚀 Quick Start |
| 40 | + |
| 41 | +Get powerful reasoning improvements in 3 simple steps: |
| 42 | + |
| 43 | +```bash |
| 44 | +# 1. Install OptiLLM |
| 45 | +pip install optillm |
| 46 | + |
| 47 | +# 2. Start the server |
| 48 | +export OPENAI_API_KEY="your-key-here" |
| 49 | +optillm |
| 50 | + |
| 51 | +# 3. Use with any OpenAI client - just change the model name! |
| 52 | +``` |
| 53 | + |
| 54 | +```python |
| 55 | +from openai import OpenAI |
| 56 | + |
| 57 | +client = OpenAI(base_url="http://localhost:8000/v1") |
| 58 | + |
| 59 | +# Add 'moa-' prefix for Mixture of Agents optimization |
| 60 | +response = client.chat.completions.create( |
| 61 | + model="moa-gpt-4o-mini", # This gives you GPT-4o performance from GPT-4o-mini! |
| 62 | + messages=[{"role": "user", "content": "Solve: If 2x + 3 = 7, what is x?"}] |
| 63 | +) |
| 64 | +``` |
| 65 | + |
| 66 | +**Before OptiLLM**: "x = 2" ❌ |
| 67 | +**After OptiLLM**: "Let me work through this step by step: 2x + 3 = 7, so 2x = 4, therefore x = 2" ✅ |
10 | 68 |
|
11 | | -## Installation |
| 69 | +## 📊 Proven Results |
| 70 | + |
| 71 | +OptiLLM delivers measurable improvements across diverse benchmarks: |
| 72 | + |
| 73 | +| Technique | Base Model | Improvement | Benchmark | |
| 74 | +|-----------|------------|-------------|-----------| |
| 75 | +| **CePO** | Llama 3.3 70B | **+18.6 points** | Math-L5 (51.0→69.6) | |
| 76 | +| **AutoThink** | DeepSeek-R1-1.5B | **+9.34 points** | GPQA-Diamond (21.72→31.06) | |
| 77 | +| **LongCePO** | Llama 3.3 70B | **+13.6 points** | InfiniteBench (58.0→71.6) | |
| 78 | +| **MOA** | GPT-4o-mini | **Matches GPT-4** | Arena-Hard-Auto | |
| 79 | +| **PlanSearch** | GPT-4o-mini | **+20% pass@5** | LiveCodeBench | |
| 80 | + |
| 81 | +*Full benchmark results below* ⬇️ |
| 82 | + |
| 83 | +## 🏗️ Installation |
12 | 84 |
|
13 | 85 | ### Using pip |
14 | 86 |
|
@@ -48,6 +120,48 @@ source .venv/bin/activate |
48 | 120 | pip install -r requirements.txt |
49 | 121 | ``` |
50 | 122 |
|
| 123 | +## Implemented techniques |
| 124 | + |
| 125 | +| Approach | Slug | Description | |
| 126 | +| ------------------------------------ | ------------------ | ---------------------------------------------------------------------------------------------- | |
| 127 | +| [Cerebras Planning and Optimization](optillm/cepo) | `cepo` | Combines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques | |
| 128 | +| CoT with Reflection | `cot_reflection` | Implements chain-of-thought reasoning with \<thinking\>, \<reflection> and \<output> sections | |
| 129 | +| PlanSearch | `plansearch` | Implements a search algorithm over candidate plans for solving a problem in natural language | |
| 130 | +| ReRead | `re2` | Implements rereading to improve reasoning by processing queries twice | |
| 131 | +| Self-Consistency | `self_consistency` | Implements an advanced self-consistency method | |
| 132 | +| Z3 Solver | `z3` | Utilizes the Z3 theorem prover for logical reasoning | |
| 133 | +| R* Algorithm | `rstar` | Implements the R* algorithm for problem-solving | |
| 134 | +| LEAP | `leap` | Learns task-specific principles from few shot examples | |
| 135 | +| Round Trip Optimization | `rto` | Optimizes responses through a round-trip process | |
| 136 | +| Best of N Sampling | `bon` | Generates multiple responses and selects the best one | |
| 137 | +| Mixture of Agents | `moa` | Combines responses from multiple critiques | |
| 138 | +| Monte Carlo Tree Search | `mcts` | Uses MCTS for decision-making in chat responses | |
| 139 | +| PV Game | `pvg` | Applies a prover-verifier game approach at inference time | |
| 140 | +| CoT Decoding | N/A for proxy | Implements chain-of-thought decoding to elicit reasoning without explicit prompting | |
| 141 | +| Entropy Decoding | N/A for proxy | Implements adaptive sampling based on the uncertainty of tokens during generation | |
| 142 | +| Thinkdeeper | N/A for proxy | Implements the `reasoning_effort` param from OpenAI for reasoning models like DeepSeek R1 | |
| 143 | +| [AutoThink](optillm/autothink) | N/A for proxy | Combines query complexity classification with steering vectors to enhance reasoning | |
| 144 | + |
| 145 | +## Implemented plugins |
| 146 | + |
| 147 | +| Plugin | Slug | Description | |
| 148 | +| ----------------------- | ------------------ | ---------------------------------------------------------------------------------------------- | |
| 149 | +| [System Prompt Learning](optillm/plugins/spl) | `spl` | Implements what [Andrej Karpathy called the third paradigm](https://x.com/karpathy/status/1921368644069765486) for LLM learning, this enables the model to acquire program solving knowledge and strategies | |
| 150 | +| [Deep Think](optillm/plugins/deepthink) | `deepthink` | Implements a Gemini-like Deep Think approach using inference time scaling for reasoning LLMs | |
| 151 | +| [Long-Context Cerebras Planning and Optimization](optillm/plugins/longcepo) | `longcepo` | Combines planning and divide-and-conquer processing of long documents to enable infinite context | |
| 152 | +| Majority Voting | `majority_voting` | Generates k candidate solutions and selects the most frequent answer through majority voting (default k=6) | |
| 153 | +| MCP Client | `mcp` | Implements the model context protocol (MCP) client, enabling you to use any LLM with any MCP Server | |
| 154 | +| Router | `router` | Uses the [optillm-modernbert-large](https://huggingface.co/codelion/optillm-modernbert-large) model to route requests to different approaches based on the user prompt | |
| 155 | +| Chain-of-Code | `coc` | Implements a chain of code approach that combines CoT with code execution and LLM based code simulation | |
| 156 | +| Memory | `memory` | Implements a short term memory layer, enables you to use unbounded context length with any LLM | |
| 157 | +| Privacy | `privacy` | Anonymize PII data in request and deanonymize it back to original value in response | |
| 158 | +| Read URLs | `readurls` | Reads all URLs found in the request, fetches the content at the URL and adds it to the context | |
| 159 | +| Execute Code | `executecode` | Enables use of code interpreter to execute python code in requests and LLM generated responses | |
| 160 | +| JSON | `json` | Enables structured outputs using the outlines library, supports pydantic types and JSON schema | |
| 161 | +| GenSelect | `genselect` | Generative Solution Selection - generates multiple candidates and selects the best based on quality criteria | |
| 162 | +| Web Search | `web_search` | Performs Google searches using Chrome automation (Selenium) to gather search results and URLs | |
| 163 | +| [Deep Research](optillm/plugins/deep_research) | `deep_research` | Implements Test-Time Diffusion Deep Researcher (TTD-DR) for comprehensive research reports using iterative refinement | |
| 164 | + |
51 | 165 | We support all major LLM providers and models for inference. You need to set the correct environment variable and the proxy will pick the corresponding client. |
52 | 166 |
|
53 | 167 | | Provider | Required Environment Variables | Additional Notes | |
@@ -339,48 +453,6 @@ Check this log file for connection issues, tool execution errors, and other diag |
339 | 453 |
|
340 | 454 | 4. **Access denied**: For filesystem operations, ensure the paths specified in the configuration are accessible to the process. |
341 | 455 |
|
342 | | -## Implemented techniques |
343 | | - |
344 | | -| Approach | Slug | Description | |
345 | | -| ------------------------------------ | ------------------ | ---------------------------------------------------------------------------------------------- | |
346 | | -| [Cerebras Planning and Optimization](optillm/cepo) | `cepo` | Combines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques | |
347 | | -| CoT with Reflection | `cot_reflection` | Implements chain-of-thought reasoning with \<thinking\>, \<reflection> and \<output\> sections | |
348 | | -| PlanSearch | `plansearch` | Implements a search algorithm over candidate plans for solving a problem in natural language | |
349 | | -| ReRead | `re2` | Implements rereading to improve reasoning by processing queries twice | |
350 | | -| Self-Consistency | `self_consistency` | Implements an advanced self-consistency method | |
351 | | -| Z3 Solver | `z3` | Utilizes the Z3 theorem prover for logical reasoning | |
352 | | -| R* Algorithm | `rstar` | Implements the R* algorithm for problem-solving | |
353 | | -| LEAP | `leap` | Learns task-specific principles from few shot examples | |
354 | | -| Round Trip Optimization | `rto` | Optimizes responses through a round-trip process | |
355 | | -| Best of N Sampling | `bon` | Generates multiple responses and selects the best one | |
356 | | -| Mixture of Agents | `moa` | Combines responses from multiple critiques | |
357 | | -| Monte Carlo Tree Search | `mcts` | Uses MCTS for decision-making in chat responses | |
358 | | -| PV Game | `pvg` | Applies a prover-verifier game approach at inference time | |
359 | | -| CoT Decoding | N/A for proxy | Implements chain-of-thought decoding to elicit reasoning without explicit prompting | |
360 | | -| Entropy Decoding | N/A for proxy | Implements adaptive sampling based on the uncertainty of tokens during generation | |
361 | | -| Thinkdeeper | N/A for proxy | Implements the `reasoning_effort` param from OpenAI for reasoning models like DeepSeek R1 | |
362 | | -| [AutoThink](optillm/autothink) | N/A for proxy | Combines query complexity classification with steering vectors to enhance reasoning | |
363 | | - |
364 | | -## Implemented plugins |
365 | | - |
366 | | -| Plugin | Slug | Description | |
367 | | -| ----------------------- | ------------------ | ---------------------------------------------------------------------------------------------- | |
368 | | -| [System Prompt Learning](optillm/plugins/spl) | `spl` | Implements what [Andrej Karpathy called the third paradigm](https://x.com/karpathy/status/1921368644069765486) for LLM learning, this enables the model to acquire program solving knowledge and strategies | |
369 | | -| [Deep Think](optillm/plugins/deepthink) | `deepthink` | Implements a Gemini-like Deep Think approach using inference time scaling for reasoning LLMs | |
370 | | -| [Long-Context Cerebras Planning and Optimization](optillm/plugins/longcepo) | `longcepo` | Combines planning and divide-and-conquer processing of long documents to enable infinite context | |
371 | | -| Majority Voting | `majority_voting` | Generates k candidate solutions and selects the most frequent answer through majority voting (default k=6) | |
372 | | -| MCP Client | `mcp` | Implements the model context protocol (MCP) client, enabling you to use any LLM with any MCP Server | |
373 | | -| Router | `router` | Uses the [optillm-modernbert-large](https://huggingface.co/codelion/optillm-modernbert-large) model to route requests to different approaches based on the user prompt | |
374 | | -| Chain-of-Code | `coc` | Implements a chain of code approach that combines CoT with code execution and LLM based code simulation | |
375 | | -| Memory | `memory` | Implements a short term memory layer, enables you to use unbounded context length with any LLM | |
376 | | -| Privacy | `privacy` | Anonymize PII data in request and deanonymize it back to original value in response | |
377 | | -| Read URLs | `readurls` | Reads all URLs found in the request, fetches the content at the URL and adds it to the context | |
378 | | -| Execute Code | `executecode` | Enables use of code interpreter to execute python code in requests and LLM generated responses | |
379 | | -| JSON | `json` | Enables structured outputs using the outlines library, supports pydantic types and JSON schema | |
380 | | -| GenSelect | `genselect` | Generative Solution Selection - generates multiple candidates and selects the best based on quality criteria | |
381 | | -| Web Search | `web_search` | Performs Google searches using Chrome automation (Selenium) to gather search results and URLs | |
382 | | -| [Deep Research](optillm/plugins/deep_research) | `deep_research` | Implements Test-Time Diffusion Deep Researcher (TTD-DR) for comprehensive research reports using iterative refinement | |
383 | | - |
384 | 456 | ## Available parameters |
385 | 457 |
|
386 | 458 | optillm supports various command-line arguments for configuration. When using Docker, these can also be set as environment variables prefixed with `OPTILLM_`. |
@@ -607,6 +679,33 @@ All tests are automatically run on pull requests via GitHub Actions. The workflo |
607 | 679 |
|
608 | 680 | See `tests/README.md` for more details on the test structure and how to write new tests. |
609 | 681 |
|
| 682 | +## 🤝 Contributing |
| 683 | + |
| 684 | +We ❤️ contributions! OptiLLM is built by the community, for the community. |
| 685 | + |
| 686 | +- 🐛 **Found a bug?** [Open an issue](https://github.com/codelion/optillm/issues/new) |
| 687 | +- 💡 **Have an idea?** [Start a discussion](https://github.com/codelion/optillm/discussions) |
| 688 | +- 🔧 **Want to code?** Check out [good first issues](https://github.com/codelion/optillm/labels/good%20first%20issue) |
| 689 | + |
| 690 | +### Development Setup |
| 691 | +```bash |
| 692 | +git clone https://github.com/codelion/optillm.git |
| 693 | +cd optillm |
| 694 | +python -m venv .venv |
| 695 | +source .venv/bin/activate # or `.venv\Scripts\activate` on Windows |
| 696 | +pip install -r requirements.txt |
| 697 | +pip install -r tests/requirements.txt |
| 698 | + |
| 699 | +# Run tests |
| 700 | +python -m pytest tests/ |
| 701 | +``` |
| 702 | + |
| 703 | +## 🌟 Community & Support |
| 704 | + |
| 705 | +- **🚀 Companies using OptiLLM**: [Cerebras](https://cerebras.ai), [Patched](https://patched.codes), and [50+ others](https://github.com/codelion/optillm/discussions/categories/show-and-tell) |
| 706 | +- **💬 Community**: Join our [GitHub Discussions](https://github.com/codelion/optillm/discussions) |
| 707 | +- **📧 Enterprise **: For enterprise support, contact [[email protected]](mailto:[email protected]) |
| 708 | + |
610 | 709 | ## References |
611 | 710 | - [Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques](https://arxiv.org/abs/2506.08060) |
612 | 711 | - [AutoThink: efficient inference for reasoning LLMs](https://dx.doi.org/10.2139/ssrn.5253327) - [Implementation](optillm/autothink) |
@@ -639,10 +738,20 @@ If you use this library in your research, please cite: |
639 | 738 |
|
640 | 739 | ```bibtex |
641 | 740 | @software{optillm, |
642 | | - title = {Optillm: Optimizing inference proxy for LLMs}, |
| 741 | + title = {OptiLLM: Optimizing inference proxy for LLMs}, |
643 | 742 | author = {Asankhaya Sharma}, |
644 | 743 | year = {2024}, |
645 | 744 | publisher = {GitHub}, |
646 | 745 | url = {https://github.com/codelion/optillm} |
647 | 746 | } |
648 | 747 | ``` |
| 748 | + |
| 749 | +--- |
| 750 | + |
| 751 | +<p align="center"> |
| 752 | + <strong>Ready to optimize your LLMs? Install OptiLLM and see the difference! 🚀</strong> |
| 753 | +</p> |
| 754 | + |
| 755 | +<p align="center"> |
| 756 | + ⭐ <a href="https://github.com/codelion/optillm">Star us on GitHub</a> if you find OptiLLM useful! |
| 757 | +</p> |
0 commit comments