Skip to content

Commit 0fd98e6

Browse files
committed
Add MARS results and bump version to 0.3.3
Updated the README with new benchmark results and documentation for the MARS (Multi-Agent Reasoning System) approach, including performance on AIME 2025, IMO 2025, and LiveCodeBench. Incremented the package version to 0.3.3 in both __init__.py and pyproject.toml to reflect these updates.
1 parent 3513d60 commit 0fd98e6

File tree

3 files changed

+18
-2
lines changed

3 files changed

+18
-2
lines changed

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ OptiLLM delivers measurable improvements across diverse benchmarks:
7272

7373
| Technique | Base Model | Improvement | Benchmark |
7474
|-----------|------------|-------------|-----------|
75+
| **MARS** | Gemini 2.5 Flash Lite | **+30.0 points** | AIME 2025 (43.3→73.3) |
7576
| **CePO** | Llama 3.3 70B | **+18.6 points** | Math-L5 (51.0→69.6) |
7677
| **AutoThink** | DeepSeek-R1-1.5B | **+9.34 points** | GPQA-Diamond (21.72→31.06) |
7778
| **LongCePO** | Llama 3.3 70B | **+13.6 points** | InfiniteBench (58.0→71.6) |
@@ -158,6 +159,7 @@ optillm
158159

159160
| Approach | Slug | Description |
160161
| ------------------------------------ | ------------------ | ---------------------------------------------------------------------------------------------- |
162+
| [MARS (Multi-Agent Reasoning System)](optillm/mars) | `mars` | Multi-agent reasoning with diverse temperature exploration, cross-verification, and iterative improvement |
161163
| [Cerebras Planning and Optimization](optillm/cepo) | `cepo` | Combines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques |
162164
| CoT with Reflection | `cot_reflection` | Implements chain-of-thought reasoning with \<thinking\>, \<reflection> and \<output> sections |
163165
| PlanSearch | `plansearch` | Implements a search algorithm over candidate plans for solving a problem in natural language |
@@ -747,6 +749,20 @@ Authorization: Bearer your_secret_api_key
747749

748750
## SOTA results on benchmarks with optillm
749751

752+
### MARS on AIME 2025, IMO 2025, and LiveCodeBench (Oct 2025)
753+
754+
| Benchmark | Approach | Problems | Correct | Accuracy | Improvement |
755+
|-----------|----------|----------|---------|----------|-------------|
756+
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | - |
757+
| **AIME 2025** | **MARS** | 30 | **22** | **73.3%** | **+30.0pp (+69.2%)** |
758+
| **IMO 2025** | Baseline | 6 | 1 | 16.7% | - |
759+
| **IMO 2025** | **MARS** | 6 | **2** | **33.3%** | **+16.7pp (+100%)** |
760+
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | - |
761+
| **LiveCodeBench v5/v6** | **MARS** | 105 | **53** | **50.48%** | **+11.43pp (+29.3%)** |
762+
763+
Model: google/gemini-2.5-flash-lite-preview-09-2025 via OpenRouter
764+
Configuration: 3 agents, 2-pass verification, thinking tags disabled for proofs
765+
750766
### AutoThink on GPQA-Diamond & MMLU-Pro (May 2025)
751767

752768
| **Model** | **GPQA-Diamond** | | **MMLU-Pro** | |

optillm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Version information
2-
__version__ = "0.3.2"
2+
__version__ = "0.3.3"
33

44
# Import from server module
55
from .server import (

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "optillm"
7-
version = "0.3.2"
7+
version = "0.3.3"
88
description = "An optimizing inference proxy for LLMs."
99
readme = "README.md"
1010
license = "Apache-2.0"

0 commit comments

Comments
 (0)