Skip to content

Commit 82b6c24

Browse files
authored
Merge pull request #259 from codelion/feat-mars
Feat mars
2 parents 0cd322a + 0fd98e6 commit 82b6c24

24 files changed

+5643
-65
lines changed

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ OptiLLM delivers measurable improvements across diverse benchmarks:
7272

7373
| Technique | Base Model | Improvement | Benchmark |
7474
|-----------|------------|-------------|-----------|
75+
| **MARS** | Gemini 2.5 Flash Lite | **+30.0 points** | AIME 2025 (43.3→73.3) |
7576
| **CePO** | Llama 3.3 70B | **+18.6 points** | Math-L5 (51.0→69.6) |
7677
| **AutoThink** | DeepSeek-R1-1.5B | **+9.34 points** | GPQA-Diamond (21.72→31.06) |
7778
| **LongCePO** | Llama 3.3 70B | **+13.6 points** | InfiniteBench (58.0→71.6) |
@@ -158,6 +159,7 @@ optillm
158159

159160
| Approach | Slug | Description |
160161
| ------------------------------------ | ------------------ | ---------------------------------------------------------------------------------------------- |
162+
| [MARS (Multi-Agent Reasoning System)](optillm/mars) | `mars` | Multi-agent reasoning with diverse temperature exploration, cross-verification, and iterative improvement |
161163
| [Cerebras Planning and Optimization](optillm/cepo) | `cepo` | Combines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques |
162164
| CoT with Reflection | `cot_reflection` | Implements chain-of-thought reasoning with \<thinking\>, \<reflection> and \<output> sections |
163165
| PlanSearch | `plansearch` | Implements a search algorithm over candidate plans for solving a problem in natural language |
@@ -747,6 +749,20 @@ Authorization: Bearer your_secret_api_key
747749

748750
## SOTA results on benchmarks with optillm
749751

752+
### MARS on AIME 2025, IMO 2025, and LiveCodeBench (Oct 2025)
753+
754+
| Benchmark | Approach | Problems | Correct | Accuracy | Improvement |
755+
|-----------|----------|----------|---------|----------|-------------|
756+
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | - |
757+
| **AIME 2025** | **MARS** | 30 | **22** | **73.3%** | **+30.0pp (+69.2%)** |
758+
| **IMO 2025** | Baseline | 6 | 1 | 16.7% | - |
759+
| **IMO 2025** | **MARS** | 6 | **2** | **33.3%** | **+16.7pp (+100%)** |
760+
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | - |
761+
| **LiveCodeBench v5/v6** | **MARS** | 105 | **53** | **50.48%** | **+11.43pp (+29.3%)** |
762+
763+
Model: google/gemini-2.5-flash-lite-preview-09-2025 via OpenRouter
764+
Configuration: 3 agents, 2-pass verification, thinking tags disabled for proofs
765+
750766
### AutoThink on GPQA-Diamond & MMLU-Pro (May 2025)
751767

752768
| **Model** | **GPQA-Diamond** | | **MMLU-Pro** | |

optillm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Version information
2-
__version__ = "0.3.2"
2+
__version__ = "0.3.3"
33

44
# Import from server module
55
from .server import (

optillm/mars/README.md

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# MARS: Multi-Agent Reasoning System
2+
3+
A sophisticated multi-agent reasoning system designed for challenging mathematical problems, inspired by systems like Gemini 2.5 Pro Deep Think and the successful IMO25 solver.
4+
5+
## Overview
6+
7+
MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through:
8+
- **Multi-agent exploration** with diverse reasoning approaches (3 agents by default, configurable)
9+
- **Rigorous verification** using a 2-pass consensus threshold (configurable)
10+
- **Iterative improvement** based on verification feedback
11+
- **OpenRouter reasoning API** for deep mathematical thinking
12+
- **RSA-inspired aggregation** for solution refinement
13+
- **Strategy network** for cross-agent insight sharing
14+
- **Shared workspace** for agent collaboration
15+
16+
## Key Features
17+
18+
### 1. Multi-Agent Architecture
19+
- **3 parallel agents** by default (configurable: 2 for lightweight, 3+ for advanced)
20+
- **Temperature diversity** (0.3, 0.6, 1.0) ensures varied exploration strategies
21+
- **Independent reasoning** followed by collaborative verification
22+
23+
### 2. OpenRouter Reasoning API Integration
24+
- **Effort-based reasoning**: "low", "medium", "high" effort levels via OpenRouter API
25+
- **Adaptive allocation**: Low effort (temp ≤ 0.4), Medium (0.4-0.8), High (> 0.8)
26+
- **Configurable token budgets**: 4K for lightweight coding, 64K for complex reasoning
27+
28+
### 3. Verification System
29+
- **2-pass threshold** by default (configurable: 1 for lightweight, 2+ for advanced)
30+
- **Cross-agent verification**: Agents verify each other's solutions
31+
- **Mathematical rigor**: Focus on complete proofs, not just correct answers
32+
- **Consensus building**: Multiple verified solutions required
33+
34+
### 4. Iterative Improvement
35+
- **Feedback-driven**: Solutions improved based on verification feedback
36+
- **Error correction**: Automatic identification and fixing of mathematical errors
37+
- **Logical gap filling**: Strengthening incomplete reasoning steps
38+
39+
## Architecture Components
40+
41+
```
42+
optillm/mars/
43+
├── __init__.py # Package exports
44+
├── mars.py # Main orchestration with parallel execution
45+
├── agent.py # Individual agent implementation
46+
├── workspace.py # Shared collaboration workspace
47+
├── verifier.py # Multi-pass verification system
48+
├── aggregator.py # RSA-inspired solution aggregation
49+
├── strategy_network.py # Cross-agent insight sharing
50+
├── answer_extraction.py # Clean answer extraction with thinking tags
51+
├── prompts.py # Mathematical reasoning prompts
52+
└── README.md # This documentation
53+
```
54+
55+
## Configuration
56+
57+
### Default Configuration (Mathematical Reasoning)
58+
```python
59+
DEFAULT_CONFIG = {
60+
'num_agents': 3, # Number of parallel agents
61+
'max_iterations': 5, # Maximum improvement iterations
62+
'verification_passes_required': 2, # Consecutive passes needed
63+
'consensus_threshold': 2, # Verified solutions for consensus
64+
'min_verified_solutions': 1, # Minimum to proceed
65+
'max_tokens': 64000, # Token budget for complex reasoning
66+
'max_verification_attempts': 3, # Max verification retries
67+
'early_termination': True, # Stop on consensus
68+
'use_reasoning_api': True, # Enable OpenRouter reasoning
69+
# RSA-inspired aggregation
70+
'enable_aggregation': True, # Enable solution aggregation
71+
'population_size': 6, # Population for diversity
72+
'aggregation_size': 3, # Solutions per aggregation
73+
'aggregation_loops': 3, # Aggregation iterations
74+
# Strategy Network
75+
'enable_strategy_network': True, # Cross-agent insight sharing
76+
'strategy_extraction_enabled': True, # Extract reasoning strategies
77+
'cross_agent_enhancement': True, # Enhanced solutions via peer strategies
78+
# Thinking tags and answer extraction
79+
'use_thinking_tags': True, # Wrap reasoning in <think> tags
80+
'answer_extraction_mode': 'auto', # 'auto', 'code', 'math', or 'none'
81+
}
82+
```
83+
84+
### Lightweight Configuration (Coding Benchmarks)
85+
```python
86+
LIGHTWEIGHT_CONFIG = {
87+
'num_agents': 2, # Reduced agent count
88+
'max_iterations': 2, # Faster iteration limit
89+
'verification_passes_required': 1, # Single-pass verification
90+
'consensus_threshold': 1, # Lower threshold for 2 agents
91+
'min_verified_solutions': 1,
92+
'max_tokens': 4000, # Smaller token budget
93+
'max_verification_attempts': 2,
94+
'early_termination': True,
95+
'use_reasoning_api': True,
96+
# Disable expensive features for speed
97+
'enable_aggregation': False, # Skip RSA aggregation
98+
'enable_strategy_network': False, # Skip strategy network
99+
'strategy_extraction_enabled': False,
100+
'cross_agent_enhancement': False,
101+
# Thinking tags still enabled
102+
'use_thinking_tags': True,
103+
'answer_extraction_mode': 'auto',
104+
}
105+
```
106+
107+
**Note**: MARS automatically uses lightweight config when `max_tokens ≤ 4000` in the request.
108+
109+
## Usage
110+
111+
### Via OptiLLM Server
112+
```bash
113+
# Start OptiLLM with MARS support
114+
python optillm.py --model google/gemma-2.5-flash-lite --approach mars
115+
116+
# Make API call
117+
curl -X POST http://localhost:8000/v1/chat/completions \
118+
-H "Content-Type: application/json" \
119+
-d '{
120+
"model": "mars-google/gemma-2.5-flash-lite",
121+
"messages": [
122+
{
123+
"role": "user",
124+
"content": "Solve this IMO problem: Find all positive integers n such that..."
125+
}
126+
]
127+
}'
128+
```
129+
130+
### Via extra_body Parameter
131+
```python
132+
import openai
133+
134+
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="anything")
135+
136+
response = client.chat.completions.create(
137+
model="google/gemma-2.5-flash-lite",
138+
messages=[
139+
{"role": "user", "content": "Mathematical problem here"}
140+
],
141+
extra_body={"optillm_approach": "mars"}
142+
)
143+
```
144+
145+
### Via Prompt Tags
146+
```python
147+
response = client.chat.completions.create(
148+
model="google/gemma-2.5-flash-lite",
149+
messages=[
150+
{"role": "system", "content": "<optillm_approach>mars</optillm_approach>"},
151+
{"role": "user", "content": "Mathematical problem here"}
152+
]
153+
)
154+
```
155+
156+
## Process Flow
157+
158+
### Phase 1: Multi-Agent Exploration (Parallel)
159+
1. Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0)
160+
2. Each agent independently analyzes the problem
161+
3. Generate initial solutions using OpenRouter reasoning API with effort levels
162+
4. All agent API calls executed in parallel via ThreadPoolExecutor
163+
5. Solutions stored in shared workspace
164+
165+
### Phase 2a: RSA-Inspired Aggregation (Optional, Parallel)
166+
1. Maintain population of N=6 solutions for diversity
167+
2. Select K=3 solutions for aggregation
168+
3. Run T=3 aggregation loops to refine solutions
169+
4. Parallel execution of aggregation API calls
170+
5. Enhanced solutions added back to workspace
171+
172+
### Phase 2b: Cross-Agent Strategy Network (Optional, Parallel)
173+
1. Extract reasoning strategies from agent solutions
174+
2. Identify successful patterns and techniques
175+
3. Share strategies across agents
176+
4. Generate enhanced solutions using peer insights
177+
5. Parallel execution of strategy extraction and enhancement
178+
179+
### Phase 3: Verification System (Parallel)
180+
1. Cross-agent verification of all solutions
181+
2. Each solution requires 2 consecutive "CORRECT" assessments (configurable)
182+
3. Verification feedback captured for improvement
183+
4. Solutions marked as verified/unverified
184+
5. Parallel execution of verification calls
185+
186+
### Phase 4: Iterative Improvement (Parallel)
187+
1. Unverified solutions improved based on feedback
188+
2. Agents address specific issues identified in verification
189+
3. Re-verification of improved solutions
190+
4. Process continues until consensus or max iterations (5 default)
191+
5. Parallel execution of improvement and verification
192+
193+
### Phase 5: Final Synthesis
194+
1. **Numerical voting**: If 2+ agents agree on same numerical answer, use that solution
195+
2. **Best verified solution**: Otherwise, select highest-scoring verified solution
196+
3. **Synthesis**: If no verified solution, synthesize from top 3 solutions
197+
4. **Answer extraction**: Apply thinking tags and extract clean answer (if enabled)
198+
5. Complete solution with mathematical rigor
199+
200+
## Evaluation
201+
202+
MARS is designed to excel on challenging mathematical benchmarks:
203+
204+
- **IMO (International Mathematical Olympiad)**: Complex proof-based problems
205+
- **AIME (American Invitational Mathematics Examination)**: Numerical competition problems
206+
- **LiveCodeBench**: Competitive programming challenges
207+
- **Mathematical reasoning tasks**: General problem-solving capabilities
208+
209+
### Performance Metrics
210+
- **Accuracy**: Percentage of correctly solved problems
211+
- **Verification Rate**: Percentage of solutions passing 5-pass threshold
212+
- **Reasoning Efficiency**: Tokens used per correct solution
213+
- **Consensus Quality**: Agreement between verified solutions
214+
215+
## Benchmark Results
216+
217+
### Gemini 2.5 Flash Lite Preview Model
218+
219+
Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via OpenRouter:
220+
221+
| Benchmark | Approach | Problems | Correct | Accuracy | Notes |
222+
|-----------|----------|----------|---------|----------|-------|
223+
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
224+
| **AIME 2025** | MARS | 30 | 22 | 73.3% | **+9 problems (+30pp)** |
225+
| **IMO 2025** | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct |
226+
| **IMO 2025** | MARS (lite) | 6 | 2 | 33.3% | **+1 problem (+16.6pp)** |
227+
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
228+
| **LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | **+12 problems (+29.3%)** |
229+
230+
### Key Findings
231+
232+
#### AIME 2025: Significant Accuracy Improvement
233+
- **Results**: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
234+
- **Improvement**: +9 problems (+69.2% relative improvement), +30.0 percentage points
235+
- **Key Success Factor**: Multi-agent collaboration with verification effectively solves numerical competition problems
236+
- **Approach**: 3 agents with diverse temperatures, iterative verification and refinement
237+
238+
#### IMO 2025: Proof-Based Competition Problems
239+
240+
- **Results**: 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%)
241+
- **Improvement**: +1 problem (+100% relative improvement), +16.6 percentage points
242+
- **Problems Solved**: Problem 2 (geometry proof) + Problem 4 (number theory)
243+
- **Runtime**: ~10 minutes per problem (vs ~40 seconds baseline)
244+
- **Key Success Factor**: Multi-agent exploration with disabled thinking tags allows full proof visibility
245+
- **Configuration**: `use_thinking_tags=False`, `answer_extraction_mode="none"` for proof problems
246+
247+
#### LiveCodeBench: Strong Performance with Thinking Tags
248+
- **Results**: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
249+
- **Improvement**: +12 problems (+29.3% relative improvement), +11.43 percentage points
250+
- **Code Extraction**: 87/105 (82.9%) vs baseline 54/105 (51.4%) - **+61.1% improvement**
251+
- **Key Success Factor**: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
252+
- **Multi-agent benefit**: Different temperature agents explore varied solution approaches
253+
254+
#### Lessons Learned
255+
1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
256+
2. **MARS improves proof-based problems**: +100% relative improvement on IMO 2025 (16.7% → 33.3%)
257+
3. **Thinking tags are problem-type dependent**:
258+
-**Enable for code generation**: +29.3% improvement on LiveCodeBench
259+
-**Enable for numerical problems**: Multi-agent reasoning effective on AIME
260+
-**Disable for proof problems**: IMO proofs need full visibility to evaluators
261+
4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
262+
5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
263+
264+
### Completed Evaluations
265+
266+
-**AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
267+
-**IMO 2025**: Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) **+16.6pp improvement**
268+
-**LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**
269+
270+
*All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.*
271+
272+
### Configuration for IMO Proof Problems
273+
For proof-based problems like IMO, disable thinking tags to ensure full proof visibility:
274+
```python
275+
extra_body = {
276+
"optillm_approach": "mars",
277+
"mars_config": {
278+
"use_thinking_tags": False, # Full proof visibility
279+
"answer_extraction_mode": "none" # Proofs are the answer
280+
}
281+
}
282+
```
283+
284+
*All evaluations use pass@1 accuracy metric.*
285+
286+
## Implementation Details
287+
288+
### Temperature Diversity Strategy (3-Agent Default)
289+
- **Agent 0**: Temperature 0.3 (Conservative, rigorous, low effort)
290+
- **Agent 1**: Temperature 0.6 (Balanced approach, medium effort)
291+
- **Agent 2**: Temperature 1.0 (Maximum exploration, high effort)
292+
293+
**Note**: Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6)
294+
295+
### Reasoning Effort Allocation (OpenRouter API)
296+
- **Low effort** (temp ≤ 0.4): `{"reasoning": {"effort": "low"}}` - Conservative reasoning
297+
- **Medium effort** (0.4 < temp ≤ 0.8): `{"reasoning": {"effort": "medium"}}` - Balanced reasoning
298+
- **High effort** (temp > 0.8): `{"reasoning": {"effort": "high"}}` - Maximum reasoning depth
299+
300+
**Note**: OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities.
301+
302+
### Verification Criteria
303+
Solutions are verified based on:
304+
- **Mathematical correctness**: Accurate calculations and logic
305+
- **Completeness**: All problem aspects addressed
306+
- **Rigor**: Proper justification for each step
307+
- **Clarity**: Clear mathematical communication
308+
- **Format compliance**: Proper answer formatting
309+
310+
## Inspired By
311+
312+
- **IMO25 Solver**: 5/6 problems solved with 5-consecutive-pass verification
313+
- **Gemini 2.5 Pro Deep Think**: Native reasoning tokens and thinking budgets
314+
- **OpenRouter Reasoning API**: Standardized interface for deep thinking
315+
- **CEPO Architecture**: Multi-file approach pattern in OptiLLM
316+
317+
## Future Enhancements
318+
319+
- **Multi-model support**: Different models for different agent roles
320+
- **Dynamic temperature adjustment**: Adaptive exploration strategies
321+
- **Specialized agent roles**: Proof-focused, computation-focused, verification-focused
322+
- **Knowledge base integration**: Access to mathematical theorems and techniques
323+
- **Interactive verification**: Human-in-the-loop verification for critical problems

optillm/mars/__init__.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
"""
2+
MARS: Multi-Agent Reasoning System
3+
4+
A multi-agent reasoning system for enhanced mathematical problem solving,
5+
inspired by systems like Gemini 2.5 Pro Deep Think and the IMO25 solver.
6+
Leverages OpenRouter's reasoning API for deep mathematical thinking.
7+
"""
8+
9+
from .mars import multi_agent_reasoning_system
10+
11+
__all__ = ['multi_agent_reasoning_system']

0 commit comments

Comments
 (0)