Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
2488c2c
init
codelion Sep 21, 2025
b4146f7
Update eval_aime_benchmark.py
codelion Sep 21, 2025
c87470d
fxies
codelion Sep 23, 2025
4912a2f
a
codelion Sep 23, 2025
e2b5a43
Update mars.py
codelion Sep 23, 2025
2db5be0
hg
codelion Sep 23, 2025
5f3bf69
d
codelion Sep 23, 2025
757a806
Update mars.py
codelion Sep 23, 2025
945c9d5
Update mars.py
codelion Sep 24, 2025
7b6f652
Update agent.py
codelion Sep 24, 2025
129f098
fix
codelion Sep 24, 2025
4c9794b
fix
codelion Sep 24, 2025
13a4b91
add tests
codelion Sep 24, 2025
757a620
f
codelion Sep 24, 2025
fd0326a
Update eval_imo25_benchmark.py
codelion Sep 25, 2025
6d7f57c
Update eval_imo25_benchmark.py
codelion Sep 25, 2025
fece917
f
codelion Sep 25, 2025
615b0b6
Update eval_imo25_benchmark.py
codelion Sep 26, 2025
0ed0b40
Update eval_aime_benchmark.py
codelion Sep 26, 2025
1d26236
Merge branch 'main' into feat-mars
codelion Sep 27, 2025
e982d42
Update requirements_proxy_only.txt
codelion Sep 29, 2025
138af6d
Merge branch 'main' into feat-mars
codelion Sep 29, 2025
5cd17d6
update
codelion Sep 29, 2025
5b48bdf
fixes
codelion Sep 30, 2025
d5b0928
Merge branch 'main' into feat-mars
codelion Sep 30, 2025
bf6c6ec
Merge branch 'main' into feat-mars
codelion Sep 30, 2025
1d80ca0
fix using math verify
codelion Sep 30, 2025
72561e5
Update mars.py
codelion Sep 30, 2025
d9c04b5
fixes
codelion Oct 1, 2025
2e35fbb
wip
codelion Oct 1, 2025
780f9d3
f
codelion Oct 1, 2025
3513d60
Update README.md
codelion Oct 3, 2025
0fd98e6
Add MARS results and bump version to 0.3.3
codelion Oct 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ OptiLLM delivers measurable improvements across diverse benchmarks:

| Technique | Base Model | Improvement | Benchmark |
|-----------|------------|-------------|-----------|
| **MARS** | Gemini 2.5 Flash Lite | **+30.0 points** | AIME 2025 (43.3→73.3) |
| **CePO** | Llama 3.3 70B | **+18.6 points** | Math-L5 (51.0→69.6) |
| **AutoThink** | DeepSeek-R1-1.5B | **+9.34 points** | GPQA-Diamond (21.72→31.06) |
| **LongCePO** | Llama 3.3 70B | **+13.6 points** | InfiniteBench (58.0→71.6) |
Expand Down Expand Up @@ -158,6 +159,7 @@ optillm

| Approach | Slug | Description |
| ------------------------------------ | ------------------ | ---------------------------------------------------------------------------------------------- |
| [MARS (Multi-Agent Reasoning System)](optillm/mars) | `mars` | Multi-agent reasoning with diverse temperature exploration, cross-verification, and iterative improvement |
| [Cerebras Planning and Optimization](optillm/cepo) | `cepo` | Combines Best of N, Chain-of-Thought, Self-Reflection, Self-Improvement, and various prompting techniques |
| CoT with Reflection | `cot_reflection` | Implements chain-of-thought reasoning with \<thinking\>, \<reflection> and \<output> sections |
| PlanSearch | `plansearch` | Implements a search algorithm over candidate plans for solving a problem in natural language |
Expand Down Expand Up @@ -747,6 +749,20 @@ Authorization: Bearer your_secret_api_key

## SOTA results on benchmarks with optillm

### MARS on AIME 2025, IMO 2025, and LiveCodeBench (Oct 2025)

| Benchmark | Approach | Problems | Correct | Accuracy | Improvement |
|-----------|----------|----------|---------|----------|-------------|
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | - |
| **AIME 2025** | **MARS** | 30 | **22** | **73.3%** | **+30.0pp (+69.2%)** |
| **IMO 2025** | Baseline | 6 | 1 | 16.7% | - |
| **IMO 2025** | **MARS** | 6 | **2** | **33.3%** | **+16.7pp (+100%)** |
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | - |
| **LiveCodeBench v5/v6** | **MARS** | 105 | **53** | **50.48%** | **+11.43pp (+29.3%)** |

Model: google/gemini-2.5-flash-lite-preview-09-2025 via OpenRouter
Configuration: 3 agents, 2-pass verification, thinking tags disabled for proofs

### AutoThink on GPQA-Diamond & MMLU-Pro (May 2025)

| **Model** | **GPQA-Diamond** | | **MMLU-Pro** | |
Expand Down
2 changes: 1 addition & 1 deletion optillm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Version information
__version__ = "0.3.2"
__version__ = "0.3.3"

# Import from server module
from .server import (
Expand Down
323 changes: 323 additions & 0 deletions optillm/mars/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,323 @@
# MARS: Multi-Agent Reasoning System

A sophisticated multi-agent reasoning system designed for challenging mathematical problems, inspired by systems like Gemini 2.5 Pro Deep Think and the successful IMO25 solver.

## Overview

MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through:
- **Multi-agent exploration** with diverse reasoning approaches (3 agents by default, configurable)
- **Rigorous verification** using a 2-pass consensus threshold (configurable)
- **Iterative improvement** based on verification feedback
- **OpenRouter reasoning API** for deep mathematical thinking
- **RSA-inspired aggregation** for solution refinement
- **Strategy network** for cross-agent insight sharing
- **Shared workspace** for agent collaboration

## Key Features

### 1. Multi-Agent Architecture
- **3 parallel agents** by default (configurable: 2 for lightweight, 3+ for advanced)
- **Temperature diversity** (0.3, 0.6, 1.0) ensures varied exploration strategies
- **Independent reasoning** followed by collaborative verification

### 2. OpenRouter Reasoning API Integration
- **Effort-based reasoning**: "low", "medium", "high" effort levels via OpenRouter API
- **Adaptive allocation**: Low effort (temp ≤ 0.4), Medium (0.4-0.8), High (> 0.8)
- **Configurable token budgets**: 4K for lightweight coding, 64K for complex reasoning

### 3. Verification System
- **2-pass threshold** by default (configurable: 1 for lightweight, 2+ for advanced)
- **Cross-agent verification**: Agents verify each other's solutions
- **Mathematical rigor**: Focus on complete proofs, not just correct answers
- **Consensus building**: Multiple verified solutions required

### 4. Iterative Improvement
- **Feedback-driven**: Solutions improved based on verification feedback
- **Error correction**: Automatic identification and fixing of mathematical errors
- **Logical gap filling**: Strengthening incomplete reasoning steps

## Architecture Components

```
optillm/mars/
├── __init__.py # Package exports
├── mars.py # Main orchestration with parallel execution
├── agent.py # Individual agent implementation
├── workspace.py # Shared collaboration workspace
├── verifier.py # Multi-pass verification system
├── aggregator.py # RSA-inspired solution aggregation
├── strategy_network.py # Cross-agent insight sharing
├── answer_extraction.py # Clean answer extraction with thinking tags
├── prompts.py # Mathematical reasoning prompts
└── README.md # This documentation
```

## Configuration

### Default Configuration (Mathematical Reasoning)
```python
DEFAULT_CONFIG = {
'num_agents': 3, # Number of parallel agents
'max_iterations': 5, # Maximum improvement iterations
'verification_passes_required': 2, # Consecutive passes needed
'consensus_threshold': 2, # Verified solutions for consensus
'min_verified_solutions': 1, # Minimum to proceed
'max_tokens': 64000, # Token budget for complex reasoning
'max_verification_attempts': 3, # Max verification retries
'early_termination': True, # Stop on consensus
'use_reasoning_api': True, # Enable OpenRouter reasoning
# RSA-inspired aggregation
'enable_aggregation': True, # Enable solution aggregation
'population_size': 6, # Population for diversity
'aggregation_size': 3, # Solutions per aggregation
'aggregation_loops': 3, # Aggregation iterations
# Strategy Network
'enable_strategy_network': True, # Cross-agent insight sharing
'strategy_extraction_enabled': True, # Extract reasoning strategies
'cross_agent_enhancement': True, # Enhanced solutions via peer strategies
# Thinking tags and answer extraction
'use_thinking_tags': True, # Wrap reasoning in <think> tags
'answer_extraction_mode': 'auto', # 'auto', 'code', 'math', or 'none'
}
```

### Lightweight Configuration (Coding Benchmarks)
```python
LIGHTWEIGHT_CONFIG = {
'num_agents': 2, # Reduced agent count
'max_iterations': 2, # Faster iteration limit
'verification_passes_required': 1, # Single-pass verification
'consensus_threshold': 1, # Lower threshold for 2 agents
'min_verified_solutions': 1,
'max_tokens': 4000, # Smaller token budget
'max_verification_attempts': 2,
'early_termination': True,
'use_reasoning_api': True,
# Disable expensive features for speed
'enable_aggregation': False, # Skip RSA aggregation
'enable_strategy_network': False, # Skip strategy network
'strategy_extraction_enabled': False,
'cross_agent_enhancement': False,
# Thinking tags still enabled
'use_thinking_tags': True,
'answer_extraction_mode': 'auto',
}
```

**Note**: MARS automatically uses lightweight config when `max_tokens ≤ 4000` in the request.

## Usage

### Via OptiLLM Server
```bash
# Start OptiLLM with MARS support
python optillm.py --model google/gemma-2.5-flash-lite --approach mars

# Make API call
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mars-google/gemma-2.5-flash-lite",
"messages": [
{
"role": "user",
"content": "Solve this IMO problem: Find all positive integers n such that..."
}
]
}'
```

### Via extra_body Parameter
```python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="anything")

response = client.chat.completions.create(
model="google/gemma-2.5-flash-lite",
messages=[
{"role": "user", "content": "Mathematical problem here"}
],
extra_body={"optillm_approach": "mars"}
)
```

### Via Prompt Tags
```python
response = client.chat.completions.create(
model="google/gemma-2.5-flash-lite",
messages=[
{"role": "system", "content": "<optillm_approach>mars</optillm_approach>"},
{"role": "user", "content": "Mathematical problem here"}
]
)
```

## Process Flow

### Phase 1: Multi-Agent Exploration (Parallel)
1. Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0)
2. Each agent independently analyzes the problem
3. Generate initial solutions using OpenRouter reasoning API with effort levels
4. All agent API calls executed in parallel via ThreadPoolExecutor
5. Solutions stored in shared workspace

### Phase 2a: RSA-Inspired Aggregation (Optional, Parallel)
1. Maintain population of N=6 solutions for diversity
2. Select K=3 solutions for aggregation
3. Run T=3 aggregation loops to refine solutions
4. Parallel execution of aggregation API calls
5. Enhanced solutions added back to workspace

### Phase 2b: Cross-Agent Strategy Network (Optional, Parallel)
1. Extract reasoning strategies from agent solutions
2. Identify successful patterns and techniques
3. Share strategies across agents
4. Generate enhanced solutions using peer insights
5. Parallel execution of strategy extraction and enhancement

### Phase 3: Verification System (Parallel)
1. Cross-agent verification of all solutions
2. Each solution requires 2 consecutive "CORRECT" assessments (configurable)
3. Verification feedback captured for improvement
4. Solutions marked as verified/unverified
5. Parallel execution of verification calls

### Phase 4: Iterative Improvement (Parallel)
1. Unverified solutions improved based on feedback
2. Agents address specific issues identified in verification
3. Re-verification of improved solutions
4. Process continues until consensus or max iterations (5 default)
5. Parallel execution of improvement and verification

### Phase 5: Final Synthesis
1. **Numerical voting**: If 2+ agents agree on same numerical answer, use that solution
2. **Best verified solution**: Otherwise, select highest-scoring verified solution
3. **Synthesis**: If no verified solution, synthesize from top 3 solutions
4. **Answer extraction**: Apply thinking tags and extract clean answer (if enabled)
5. Complete solution with mathematical rigor

## Evaluation

MARS is designed to excel on challenging mathematical benchmarks:

- **IMO (International Mathematical Olympiad)**: Complex proof-based problems
- **AIME (American Invitational Mathematics Examination)**: Numerical competition problems
- **LiveCodeBench**: Competitive programming challenges
- **Mathematical reasoning tasks**: General problem-solving capabilities

### Performance Metrics
- **Accuracy**: Percentage of correctly solved problems
- **Verification Rate**: Percentage of solutions passing 5-pass threshold
- **Reasoning Efficiency**: Tokens used per correct solution
- **Consensus Quality**: Agreement between verified solutions

## Benchmark Results

### Gemini 2.5 Flash Lite Preview Model

Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via OpenRouter:

| Benchmark | Approach | Problems | Correct | Accuracy | Notes |
|-----------|----------|----------|---------|----------|-------|
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
| **AIME 2025** | MARS | 30 | 22 | 73.3% | **+9 problems (+30pp)** |
| **IMO 2025** | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct |
| **IMO 2025** | MARS (lite) | 6 | 2 | 33.3% | **+1 problem (+16.6pp)** |
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
| **LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | **+12 problems (+29.3%)** |

### Key Findings

#### AIME 2025: Significant Accuracy Improvement
- **Results**: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
- **Improvement**: +9 problems (+69.2% relative improvement), +30.0 percentage points
- **Key Success Factor**: Multi-agent collaboration with verification effectively solves numerical competition problems
- **Approach**: 3 agents with diverse temperatures, iterative verification and refinement

#### IMO 2025: Proof-Based Competition Problems

- **Results**: 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%)
- **Improvement**: +1 problem (+100% relative improvement), +16.6 percentage points
- **Problems Solved**: Problem 2 (geometry proof) + Problem 4 (number theory)
- **Runtime**: ~10 minutes per problem (vs ~40 seconds baseline)
- **Key Success Factor**: Multi-agent exploration with disabled thinking tags allows full proof visibility
- **Configuration**: `use_thinking_tags=False`, `answer_extraction_mode="none"` for proof problems

#### LiveCodeBench: Strong Performance with Thinking Tags
- **Results**: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
- **Improvement**: +12 problems (+29.3% relative improvement), +11.43 percentage points
- **Code Extraction**: 87/105 (82.9%) vs baseline 54/105 (51.4%) - **+61.1% improvement**
- **Key Success Factor**: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
- **Multi-agent benefit**: Different temperature agents explore varied solution approaches

#### Lessons Learned
1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
2. **MARS improves proof-based problems**: +100% relative improvement on IMO 2025 (16.7% → 33.3%)
3. **Thinking tags are problem-type dependent**:
- ✅ **Enable for code generation**: +29.3% improvement on LiveCodeBench
- ✅ **Enable for numerical problems**: Multi-agent reasoning effective on AIME
- ❌ **Disable for proof problems**: IMO proofs need full visibility to evaluators
4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)

### Completed Evaluations

- ✅ **AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
- ✅ **IMO 2025**: Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) **+16.6pp improvement**
- ✅ **LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**

*All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.*

### Configuration for IMO Proof Problems
For proof-based problems like IMO, disable thinking tags to ensure full proof visibility:
```python
extra_body = {
"optillm_approach": "mars",
"mars_config": {
"use_thinking_tags": False, # Full proof visibility
"answer_extraction_mode": "none" # Proofs are the answer
}
}
```

*All evaluations use pass@1 accuracy metric.*

## Implementation Details

### Temperature Diversity Strategy (3-Agent Default)
- **Agent 0**: Temperature 0.3 (Conservative, rigorous, low effort)
- **Agent 1**: Temperature 0.6 (Balanced approach, medium effort)
- **Agent 2**: Temperature 1.0 (Maximum exploration, high effort)

**Note**: Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6)

### Reasoning Effort Allocation (OpenRouter API)
- **Low effort** (temp ≤ 0.4): `{"reasoning": {"effort": "low"}}` - Conservative reasoning
- **Medium effort** (0.4 < temp ≤ 0.8): `{"reasoning": {"effort": "medium"}}` - Balanced reasoning
- **High effort** (temp > 0.8): `{"reasoning": {"effort": "high"}}` - Maximum reasoning depth

**Note**: OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities.

### Verification Criteria
Solutions are verified based on:
- **Mathematical correctness**: Accurate calculations and logic
- **Completeness**: All problem aspects addressed
- **Rigor**: Proper justification for each step
- **Clarity**: Clear mathematical communication
- **Format compliance**: Proper answer formatting

## Inspired By

- **IMO25 Solver**: 5/6 problems solved with 5-consecutive-pass verification
- **Gemini 2.5 Pro Deep Think**: Native reasoning tokens and thinking budgets
- **OpenRouter Reasoning API**: Standardized interface for deep thinking
- **CEPO Architecture**: Multi-file approach pattern in OptiLLM

## Future Enhancements

- **Multi-model support**: Different models for different agent roles
- **Dynamic temperature adjustment**: Adaptive exploration strategies
- **Specialized agent roles**: Proof-focused, computation-focused, verification-focused
- **Knowledge base integration**: Access to mathematical theorems and techniques
- **Interactive verification**: Human-in-the-loop verification for critical problems
11 changes: 11 additions & 0 deletions optillm/mars/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""
MARS: Multi-Agent Reasoning System

A multi-agent reasoning system for enhanced mathematical problem solving,
inspired by systems like Gemini 2.5 Pro Deep Think and the IMO25 solver.
Leverages OpenRouter's reasoning API for deep mathematical thinking.
"""

from .mars import multi_agent_reasoning_system

__all__ = ['multi_agent_reasoning_system']
Loading