Skip to content

Commit 3513d60

Browse files
committed
Update README.md
1 parent 780f9d3 commit 3513d60

File tree

1 file changed

+142
-79
lines changed

1 file changed

+142
-79
lines changed

optillm/mars/README.md

Lines changed: 142 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,28 @@ A sophisticated multi-agent reasoning system designed for challenging mathematic
55
## Overview
66

77
MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through:
8-
- **Multi-agent exploration** with diverse reasoning approaches
9-
- **Rigorous verification** using a 5-pass consensus threshold
8+
- **Multi-agent exploration** with diverse reasoning approaches (3 agents by default, configurable)
9+
- **Rigorous verification** using a 2-pass consensus threshold (configurable)
1010
- **Iterative improvement** based on verification feedback
1111
- **OpenRouter reasoning API** for deep mathematical thinking
12+
- **RSA-inspired aggregation** for solution refinement
13+
- **Strategy network** for cross-agent insight sharing
1214
- **Shared workspace** for agent collaboration
1315

1416
## Key Features
1517

1618
### 1. Multi-Agent Architecture
17-
- **5 parallel agents** with different temperature settings (0.3-1.0)
18-
- **Temperature diversity** ensures varied exploration strategies
19+
- **3 parallel agents** by default (configurable: 2 for lightweight, 3+ for advanced)
20+
- **Temperature diversity** (0.3, 0.6, 1.0) ensures varied exploration strategies
1921
- **Independent reasoning** followed by collaborative verification
2022

2123
### 2. OpenRouter Reasoning API Integration
22-
- **Thinking tokens**: Up to 32,768 tokens for deep reasoning
23-
- **Effort levels**: Low (20%), Medium (50%), High (80%) reasoning budgets
24-
- **Adaptive allocation** based on agent temperature and problem complexity
24+
- **Effort-based reasoning**: "low", "medium", "high" effort levels via OpenRouter API
25+
- **Adaptive allocation**: Low effort (temp ≤ 0.4), Medium (0.4-0.8), High (> 0.8)
26+
- **Configurable token budgets**: 4K for lightweight coding, 64K for complex reasoning
2527

2628
### 3. Verification System
27-
- **5-pass threshold**: Solutions must pass 5 consecutive verifications
29+
- **2-pass threshold** by default (configurable: 1 for lightweight, 2+ for advanced)
2830
- **Cross-agent verification**: Agents verify each other's solutions
2931
- **Mathematical rigor**: Focus on complete proofs, not just correct answers
3032
- **Consensus building**: Multiple verified solutions required
@@ -38,33 +40,72 @@ MARS leverages multiple AI agents working collaboratively to solve complex mathe
3840

3941
```
4042
optillm/mars/
41-
├── __init__.py # Package exports
42-
├── mars.py # Main orchestration logic
43-
├── agent.py # Individual agent implementation
44-
├── workspace.py # Shared collaboration workspace
45-
├── verifier.py # 5-pass verification system
46-
├── prompts.py # Mathematical reasoning prompts
47-
└── README.md # This documentation
43+
├── __init__.py # Package exports
44+
├── mars.py # Main orchestration with parallel execution
45+
├── agent.py # Individual agent implementation
46+
├── workspace.py # Shared collaboration workspace
47+
├── verifier.py # Multi-pass verification system
48+
├── aggregator.py # RSA-inspired solution aggregation
49+
├── strategy_network.py # Cross-agent insight sharing
50+
├── answer_extraction.py # Clean answer extraction with thinking tags
51+
├── prompts.py # Mathematical reasoning prompts
52+
└── README.md # This documentation
4853
```
4954

5055
## Configuration
5156

52-
### Default Configuration
57+
### Default Configuration (Mathematical Reasoning)
5358
```python
5459
DEFAULT_CONFIG = {
55-
'num_agents': 5, # Number of parallel agents
56-
'max_iterations': 30, # Maximum improvement iterations
57-
'verification_passes_required': 5, # Consecutive passes needed
58-
'consensus_threshold': 2, # Verified solutions for consensus
59-
'min_verified_solutions': 1, # Minimum to proceed
60-
'thinking_budget_initial': 10000, # Initial reasoning tokens
61-
'thinking_budget_max': 32000, # Maximum reasoning tokens
62-
'max_response_tokens': 4096, # Maximum response length
63-
'early_termination': True, # Stop on consensus
64-
'use_reasoning_api': True # Enable OpenRouter reasoning
60+
'num_agents': 3, # Number of parallel agents
61+
'max_iterations': 5, # Maximum improvement iterations
62+
'verification_passes_required': 2, # Consecutive passes needed
63+
'consensus_threshold': 2, # Verified solutions for consensus
64+
'min_verified_solutions': 1, # Minimum to proceed
65+
'max_tokens': 64000, # Token budget for complex reasoning
66+
'max_verification_attempts': 3, # Max verification retries
67+
'early_termination': True, # Stop on consensus
68+
'use_reasoning_api': True, # Enable OpenRouter reasoning
69+
# RSA-inspired aggregation
70+
'enable_aggregation': True, # Enable solution aggregation
71+
'population_size': 6, # Population for diversity
72+
'aggregation_size': 3, # Solutions per aggregation
73+
'aggregation_loops': 3, # Aggregation iterations
74+
# Strategy Network
75+
'enable_strategy_network': True, # Cross-agent insight sharing
76+
'strategy_extraction_enabled': True, # Extract reasoning strategies
77+
'cross_agent_enhancement': True, # Enhanced solutions via peer strategies
78+
# Thinking tags and answer extraction
79+
'use_thinking_tags': True, # Wrap reasoning in <think> tags
80+
'answer_extraction_mode': 'auto', # 'auto', 'code', 'math', or 'none'
6581
}
6682
```
6783

84+
### Lightweight Configuration (Coding Benchmarks)
85+
```python
86+
LIGHTWEIGHT_CONFIG = {
87+
'num_agents': 2, # Reduced agent count
88+
'max_iterations': 2, # Faster iteration limit
89+
'verification_passes_required': 1, # Single-pass verification
90+
'consensus_threshold': 1, # Lower threshold for 2 agents
91+
'min_verified_solutions': 1,
92+
'max_tokens': 4000, # Smaller token budget
93+
'max_verification_attempts': 2,
94+
'early_termination': True,
95+
'use_reasoning_api': True,
96+
# Disable expensive features for speed
97+
'enable_aggregation': False, # Skip RSA aggregation
98+
'enable_strategy_network': False, # Skip strategy network
99+
'strategy_extraction_enabled': False,
100+
'cross_agent_enhancement': False,
101+
# Thinking tags still enabled
102+
'use_thinking_tags': True,
103+
'answer_extraction_mode': 'auto',
104+
}
105+
```
106+
107+
**Note**: MARS automatically uses lightweight config when `max_tokens ≤ 4000` in the request.
108+
68109
## Usage
69110

70111
### Via OptiLLM Server
@@ -114,29 +155,47 @@ response = client.chat.completions.create(
114155

115156
## Process Flow
116157

117-
### Phase 1: Multi-Agent Exploration
118-
1. Initialize 5 agents with diverse temperatures
158+
### Phase 1: Multi-Agent Exploration (Parallel)
159+
1. Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0)
119160
2. Each agent independently analyzes the problem
120-
3. Generate initial solutions using OpenRouter reasoning API
121-
4. Solutions stored in shared workspace
122-
123-
### Phase 2: Verification System
161+
3. Generate initial solutions using OpenRouter reasoning API with effort levels
162+
4. All agent API calls executed in parallel via ThreadPoolExecutor
163+
5. Solutions stored in shared workspace
164+
165+
### Phase 2a: RSA-Inspired Aggregation (Optional, Parallel)
166+
1. Maintain population of N=6 solutions for diversity
167+
2. Select K=3 solutions for aggregation
168+
3. Run T=3 aggregation loops to refine solutions
169+
4. Parallel execution of aggregation API calls
170+
5. Enhanced solutions added back to workspace
171+
172+
### Phase 2b: Cross-Agent Strategy Network (Optional, Parallel)
173+
1. Extract reasoning strategies from agent solutions
174+
2. Identify successful patterns and techniques
175+
3. Share strategies across agents
176+
4. Generate enhanced solutions using peer insights
177+
5. Parallel execution of strategy extraction and enhancement
178+
179+
### Phase 3: Verification System (Parallel)
124180
1. Cross-agent verification of all solutions
125-
2. Each solution requires 5 consecutive "CORRECT" assessments
181+
2. Each solution requires 2 consecutive "CORRECT" assessments (configurable)
126182
3. Verification feedback captured for improvement
127183
4. Solutions marked as verified/unverified
184+
5. Parallel execution of verification calls
128185

129-
### Phase 3: Iterative Improvement
186+
### Phase 4: Iterative Improvement (Parallel)
130187
1. Unverified solutions improved based on feedback
131188
2. Agents address specific issues identified in verification
132189
3. Re-verification of improved solutions
133-
4. Process continues until consensus or max iterations
190+
4. Process continues until consensus or max iterations (5 default)
191+
5. Parallel execution of improvement and verification
134192

135-
### Phase 4: Final Synthesis
136-
1. Best verified solution selected as final answer
137-
2. If no verified solutions, synthesis from all attempts
138-
3. High-effort reasoning applied to synthesis
139-
4. Complete solution with mathematical rigor
193+
### Phase 5: Final Synthesis
194+
1. **Numerical voting**: If 2+ agents agree on same numerical answer, use that solution
195+
2. **Best verified solution**: Otherwise, select highest-scoring verified solution
196+
3. **Synthesis**: If no verified solution, synthesize from top 3 solutions
197+
4. **Answer extraction**: Apply thinking tags and extract clean answer (if enabled)
198+
5. Complete solution with mathematical rigor
140199

141200
## Evaluation
142201

@@ -163,9 +222,8 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
163222
|-----------|----------|----------|---------|----------|-------|
164223
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
165224
| **AIME 2025** | MARS | 30 | 22 | 73.3% | **+9 problems (+30pp)** |
166-
| **IMO 2025** | Baseline | 6 | 3 | 50.0% | Problems 2, 4 & 5 correct |
167-
| **IMO 2025** | MARS (w/ thinking) | 6 | 0 | 0.0% | Thinking tags hid proofs |
168-
| **IMO 2025** | MARS (fixed) | 6 | TBD | TBD% | Proof visibility fixes needed |
225+
| **IMO 2025** | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct |
226+
| **IMO 2025** | MARS (lite) | 6 | 2 | 33.3% | **+1 problem (+16.6pp)** |
169227
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
170228
| **LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | **+12 problems (+29.3%)** |
171229

@@ -175,7 +233,16 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
175233
- **Results**: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
176234
- **Improvement**: +9 problems (+69.2% relative improvement), +30.0 percentage points
177235
- **Key Success Factor**: Multi-agent collaboration with verification effectively solves numerical competition problems
178-
- **Approach**: 5 agents with diverse temperatures, iterative verification and refinement
236+
- **Approach**: 3 agents with diverse temperatures, iterative verification and refinement
237+
238+
#### IMO 2025: Proof-Based Competition Problems
239+
240+
- **Results**: 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%)
241+
- **Improvement**: +1 problem (+100% relative improvement), +16.6 percentage points
242+
- **Problems Solved**: Problem 2 (geometry proof) + Problem 4 (number theory)
243+
- **Runtime**: ~10 minutes per problem (vs ~40 seconds baseline)
244+
- **Key Success Factor**: Multi-agent exploration with disabled thinking tags allows full proof visibility
245+
- **Configuration**: `use_thinking_tags=False`, `answer_extraction_mode="none"` for proof problems
179246

180247
#### LiveCodeBench: Strong Performance with Thinking Tags
181248
- **Results**: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
@@ -184,14 +251,26 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
184251
- **Key Success Factor**: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
185252
- **Multi-agent benefit**: Different temperature agents explore varied solution approaches
186253

187-
#### IMO 2025 Proof-Based Problems
188-
- **Initial Challenge**: MARS scored lower than baseline (0/6 vs 3/6, baseline solved problems 2, 4, 5)
189-
- **Root Cause**: Thinking tags hid 80-85% of proof content from evaluator - proofs inside `<think>` tags not visible
190-
- **Solution**: Disable thinking tags for proof-based problems via `mars_config`
191-
- **Status**: Re-evaluation needed with proof visibility fixes
192-
- **Key Lesson**: Thinking tags are **problem-type dependent** - helpful for code/numerical, harmful for proofs
254+
#### Lessons Learned
255+
1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
256+
2. **MARS improves proof-based problems**: +100% relative improvement on IMO 2025 (16.7% → 33.3%)
257+
3. **Thinking tags are problem-type dependent**:
258+
-**Enable for code generation**: +29.3% improvement on LiveCodeBench
259+
-**Enable for numerical problems**: Multi-agent reasoning effective on AIME
260+
-**Disable for proof problems**: IMO proofs need full visibility to evaluators
261+
4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
262+
5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
193263

194-
#### Configuration for IMO Problems
264+
### Completed Evaluations
265+
266+
-**AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
267+
-**IMO 2025**: Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) **+16.6pp improvement**
268+
-**LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**
269+
270+
*All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.*
271+
272+
### Configuration for IMO Proof Problems
273+
For proof-based problems like IMO, disable thinking tags to ensure full proof visibility:
195274
```python
196275
extra_body = {
197276
"optillm_approach": "mars",
@@ -202,39 +281,23 @@ extra_body = {
202281
}
203282
```
204283

205-
#### Lessons Learned
206-
1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
207-
2. **Thinking tags are problem-type dependent**:
208-
-**Enable for code generation**: +29.3% improvement on LiveCodeBench
209-
-**Enable for numerical problems**: Multi-agent reasoning effective on AIME
210-
-**Disable for mathematical proofs**: Hides critical reasoning from evaluators
211-
3. **Answer extraction** must be disabled for proof-based problems - the proof IS the answer
212-
4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
213-
5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
284+
*All evaluations use pass@1 accuracy metric.*
214285

215-
### Completed Evaluations (google/gemini-2.5-flash-lite-preview-09-2025)
216-
-**AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
217-
-**IMO 2025**: Baseline 3/6 (50.0%), MARS with thinking tags 0/6 (0.0% - proofs hidden)
218-
-**LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**
286+
## Implementation Details
219287

220-
### Ongoing Work
221-
- 🔄 IMO 2025 MARS re-evaluation with proof visibility fixes (disable thinking tags)
288+
### Temperature Diversity Strategy (3-Agent Default)
289+
- **Agent 0**: Temperature 0.3 (Conservative, rigorous, low effort)
290+
- **Agent 1**: Temperature 0.6 (Balanced approach, medium effort)
291+
- **Agent 2**: Temperature 1.0 (Maximum exploration, high effort)
222292

223-
*All evaluations use pass@1 accuracy metric.*
293+
**Note**: Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6)
224294

225-
## Implementation Details
295+
### Reasoning Effort Allocation (OpenRouter API)
296+
- **Low effort** (temp ≤ 0.4): `{"reasoning": {"effort": "low"}}` - Conservative reasoning
297+
- **Medium effort** (0.4 < temp ≤ 0.8): `{"reasoning": {"effort": "medium"}}` - Balanced reasoning
298+
- **High effort** (temp > 0.8): `{"reasoning": {"effort": "high"}}` - Maximum reasoning depth
226299

227-
### Temperature Diversity Strategy
228-
- **Agent 0**: Temperature 0.3 (Conservative, rigorous)
229-
- **Agent 1**: Temperature 0.5 (Balanced approach)
230-
- **Agent 2**: Temperature 0.7 (Creative exploration)
231-
- **Agent 3**: Temperature 0.9 (High creativity)
232-
- **Agent 4**: Temperature 1.0 (Maximum exploration)
233-
234-
### Reasoning Budget Allocation
235-
- **Low effort (temp ≤ 0.4)**: 20% of reasoning budget
236-
- **Medium effort (0.4 < temp ≤ 0.7)**: 50% of reasoning budget
237-
- **High effort (temp > 0.7)**: 80% of reasoning budget
300+
**Note**: OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities.
238301

239302
### Verification Criteria
240303
Solutions are verified based on:

0 commit comments

Comments
 (0)