|
| 1 | +# MARS: Multi-Agent Reasoning System |
| 2 | + |
| 3 | +A sophisticated multi-agent reasoning system designed for challenging mathematical problems, inspired by systems like Gemini 2.5 Pro Deep Think and the successful IMO25 solver. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through: |
| 8 | +- **Multi-agent exploration** with diverse reasoning approaches (3 agents by default, configurable) |
| 9 | +- **Rigorous verification** using a 2-pass consensus threshold (configurable) |
| 10 | +- **Iterative improvement** based on verification feedback |
| 11 | +- **OpenRouter reasoning API** for deep mathematical thinking |
| 12 | +- **RSA-inspired aggregation** for solution refinement |
| 13 | +- **Strategy network** for cross-agent insight sharing |
| 14 | +- **Shared workspace** for agent collaboration |
| 15 | + |
| 16 | +## Key Features |
| 17 | + |
| 18 | +### 1. Multi-Agent Architecture |
| 19 | +- **3 parallel agents** by default (configurable: 2 for lightweight, 3+ for advanced) |
| 20 | +- **Temperature diversity** (0.3, 0.6, 1.0) ensures varied exploration strategies |
| 21 | +- **Independent reasoning** followed by collaborative verification |
| 22 | + |
| 23 | +### 2. OpenRouter Reasoning API Integration |
| 24 | +- **Effort-based reasoning**: "low", "medium", "high" effort levels via OpenRouter API |
| 25 | +- **Adaptive allocation**: Low effort (temp ≤ 0.4), Medium (0.4-0.8), High (> 0.8) |
| 26 | +- **Configurable token budgets**: 4K for lightweight coding, 64K for complex reasoning |
| 27 | + |
| 28 | +### 3. Verification System |
| 29 | +- **2-pass threshold** by default (configurable: 1 for lightweight, 2+ for advanced) |
| 30 | +- **Cross-agent verification**: Agents verify each other's solutions |
| 31 | +- **Mathematical rigor**: Focus on complete proofs, not just correct answers |
| 32 | +- **Consensus building**: Multiple verified solutions required |
| 33 | + |
| 34 | +### 4. Iterative Improvement |
| 35 | +- **Feedback-driven**: Solutions improved based on verification feedback |
| 36 | +- **Error correction**: Automatic identification and fixing of mathematical errors |
| 37 | +- **Logical gap filling**: Strengthening incomplete reasoning steps |
| 38 | + |
| 39 | +## Architecture Components |
| 40 | + |
| 41 | +``` |
| 42 | +optillm/mars/ |
| 43 | +├── __init__.py # Package exports |
| 44 | +├── mars.py # Main orchestration with parallel execution |
| 45 | +├── agent.py # Individual agent implementation |
| 46 | +├── workspace.py # Shared collaboration workspace |
| 47 | +├── verifier.py # Multi-pass verification system |
| 48 | +├── aggregator.py # RSA-inspired solution aggregation |
| 49 | +├── strategy_network.py # Cross-agent insight sharing |
| 50 | +├── answer_extraction.py # Clean answer extraction with thinking tags |
| 51 | +├── prompts.py # Mathematical reasoning prompts |
| 52 | +└── README.md # This documentation |
| 53 | +``` |
| 54 | + |
| 55 | +## Configuration |
| 56 | + |
| 57 | +### Default Configuration (Mathematical Reasoning) |
| 58 | +```python |
| 59 | +DEFAULT_CONFIG = { |
| 60 | + 'num_agents': 3, # Number of parallel agents |
| 61 | + 'max_iterations': 5, # Maximum improvement iterations |
| 62 | + 'verification_passes_required': 2, # Consecutive passes needed |
| 63 | + 'consensus_threshold': 2, # Verified solutions for consensus |
| 64 | + 'min_verified_solutions': 1, # Minimum to proceed |
| 65 | + 'max_tokens': 64000, # Token budget for complex reasoning |
| 66 | + 'max_verification_attempts': 3, # Max verification retries |
| 67 | + 'early_termination': True, # Stop on consensus |
| 68 | + 'use_reasoning_api': True, # Enable OpenRouter reasoning |
| 69 | + # RSA-inspired aggregation |
| 70 | + 'enable_aggregation': True, # Enable solution aggregation |
| 71 | + 'population_size': 6, # Population for diversity |
| 72 | + 'aggregation_size': 3, # Solutions per aggregation |
| 73 | + 'aggregation_loops': 3, # Aggregation iterations |
| 74 | + # Strategy Network |
| 75 | + 'enable_strategy_network': True, # Cross-agent insight sharing |
| 76 | + 'strategy_extraction_enabled': True, # Extract reasoning strategies |
| 77 | + 'cross_agent_enhancement': True, # Enhanced solutions via peer strategies |
| 78 | + # Thinking tags and answer extraction |
| 79 | + 'use_thinking_tags': True, # Wrap reasoning in <think> tags |
| 80 | + 'answer_extraction_mode': 'auto', # 'auto', 'code', 'math', or 'none' |
| 81 | +} |
| 82 | +``` |
| 83 | + |
| 84 | +### Lightweight Configuration (Coding Benchmarks) |
| 85 | +```python |
| 86 | +LIGHTWEIGHT_CONFIG = { |
| 87 | + 'num_agents': 2, # Reduced agent count |
| 88 | + 'max_iterations': 2, # Faster iteration limit |
| 89 | + 'verification_passes_required': 1, # Single-pass verification |
| 90 | + 'consensus_threshold': 1, # Lower threshold for 2 agents |
| 91 | + 'min_verified_solutions': 1, |
| 92 | + 'max_tokens': 4000, # Smaller token budget |
| 93 | + 'max_verification_attempts': 2, |
| 94 | + 'early_termination': True, |
| 95 | + 'use_reasoning_api': True, |
| 96 | + # Disable expensive features for speed |
| 97 | + 'enable_aggregation': False, # Skip RSA aggregation |
| 98 | + 'enable_strategy_network': False, # Skip strategy network |
| 99 | + 'strategy_extraction_enabled': False, |
| 100 | + 'cross_agent_enhancement': False, |
| 101 | + # Thinking tags still enabled |
| 102 | + 'use_thinking_tags': True, |
| 103 | + 'answer_extraction_mode': 'auto', |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +**Note**: MARS automatically uses lightweight config when `max_tokens ≤ 4000` in the request. |
| 108 | + |
| 109 | +## Usage |
| 110 | + |
| 111 | +### Via OptiLLM Server |
| 112 | +```bash |
| 113 | +# Start OptiLLM with MARS support |
| 114 | +python optillm.py --model google/gemma-2.5-flash-lite --approach mars |
| 115 | + |
| 116 | +# Make API call |
| 117 | +curl -X POST http://localhost:8000/v1/chat/completions \ |
| 118 | + -H "Content-Type: application/json" \ |
| 119 | + -d '{ |
| 120 | + "model": "mars-google/gemma-2.5-flash-lite", |
| 121 | + "messages": [ |
| 122 | + { |
| 123 | + "role": "user", |
| 124 | + "content": "Solve this IMO problem: Find all positive integers n such that..." |
| 125 | + } |
| 126 | + ] |
| 127 | + }' |
| 128 | +``` |
| 129 | + |
| 130 | +### Via extra_body Parameter |
| 131 | +```python |
| 132 | +import openai |
| 133 | + |
| 134 | +client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="anything") |
| 135 | + |
| 136 | +response = client.chat.completions.create( |
| 137 | + model="google/gemma-2.5-flash-lite", |
| 138 | + messages=[ |
| 139 | + {"role": "user", "content": "Mathematical problem here"} |
| 140 | + ], |
| 141 | + extra_body={"optillm_approach": "mars"} |
| 142 | +) |
| 143 | +``` |
| 144 | + |
| 145 | +### Via Prompt Tags |
| 146 | +```python |
| 147 | +response = client.chat.completions.create( |
| 148 | + model="google/gemma-2.5-flash-lite", |
| 149 | + messages=[ |
| 150 | + {"role": "system", "content": "<optillm_approach>mars</optillm_approach>"}, |
| 151 | + {"role": "user", "content": "Mathematical problem here"} |
| 152 | + ] |
| 153 | +) |
| 154 | +``` |
| 155 | + |
| 156 | +## Process Flow |
| 157 | + |
| 158 | +### Phase 1: Multi-Agent Exploration (Parallel) |
| 159 | +1. Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0) |
| 160 | +2. Each agent independently analyzes the problem |
| 161 | +3. Generate initial solutions using OpenRouter reasoning API with effort levels |
| 162 | +4. All agent API calls executed in parallel via ThreadPoolExecutor |
| 163 | +5. Solutions stored in shared workspace |
| 164 | + |
| 165 | +### Phase 2a: RSA-Inspired Aggregation (Optional, Parallel) |
| 166 | +1. Maintain population of N=6 solutions for diversity |
| 167 | +2. Select K=3 solutions for aggregation |
| 168 | +3. Run T=3 aggregation loops to refine solutions |
| 169 | +4. Parallel execution of aggregation API calls |
| 170 | +5. Enhanced solutions added back to workspace |
| 171 | + |
| 172 | +### Phase 2b: Cross-Agent Strategy Network (Optional, Parallel) |
| 173 | +1. Extract reasoning strategies from agent solutions |
| 174 | +2. Identify successful patterns and techniques |
| 175 | +3. Share strategies across agents |
| 176 | +4. Generate enhanced solutions using peer insights |
| 177 | +5. Parallel execution of strategy extraction and enhancement |
| 178 | + |
| 179 | +### Phase 3: Verification System (Parallel) |
| 180 | +1. Cross-agent verification of all solutions |
| 181 | +2. Each solution requires 2 consecutive "CORRECT" assessments (configurable) |
| 182 | +3. Verification feedback captured for improvement |
| 183 | +4. Solutions marked as verified/unverified |
| 184 | +5. Parallel execution of verification calls |
| 185 | + |
| 186 | +### Phase 4: Iterative Improvement (Parallel) |
| 187 | +1. Unverified solutions improved based on feedback |
| 188 | +2. Agents address specific issues identified in verification |
| 189 | +3. Re-verification of improved solutions |
| 190 | +4. Process continues until consensus or max iterations (5 default) |
| 191 | +5. Parallel execution of improvement and verification |
| 192 | + |
| 193 | +### Phase 5: Final Synthesis |
| 194 | +1. **Numerical voting**: If 2+ agents agree on same numerical answer, use that solution |
| 195 | +2. **Best verified solution**: Otherwise, select highest-scoring verified solution |
| 196 | +3. **Synthesis**: If no verified solution, synthesize from top 3 solutions |
| 197 | +4. **Answer extraction**: Apply thinking tags and extract clean answer (if enabled) |
| 198 | +5. Complete solution with mathematical rigor |
| 199 | + |
| 200 | +## Evaluation |
| 201 | + |
| 202 | +MARS is designed to excel on challenging mathematical benchmarks: |
| 203 | + |
| 204 | +- **IMO (International Mathematical Olympiad)**: Complex proof-based problems |
| 205 | +- **AIME (American Invitational Mathematics Examination)**: Numerical competition problems |
| 206 | +- **LiveCodeBench**: Competitive programming challenges |
| 207 | +- **Mathematical reasoning tasks**: General problem-solving capabilities |
| 208 | + |
| 209 | +### Performance Metrics |
| 210 | +- **Accuracy**: Percentage of correctly solved problems |
| 211 | +- **Verification Rate**: Percentage of solutions passing 5-pass threshold |
| 212 | +- **Reasoning Efficiency**: Tokens used per correct solution |
| 213 | +- **Consensus Quality**: Agreement between verified solutions |
| 214 | + |
| 215 | +## Benchmark Results |
| 216 | + |
| 217 | +### Gemini 2.5 Flash Lite Preview Model |
| 218 | + |
| 219 | +Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via OpenRouter: |
| 220 | + |
| 221 | +| Benchmark | Approach | Problems | Correct | Accuracy | Notes | |
| 222 | +|-----------|----------|----------|---------|----------|-------| |
| 223 | +| **AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 | |
| 224 | +| **AIME 2025** | MARS | 30 | 22 | 73.3% | **+9 problems (+30pp)** | |
| 225 | +| **IMO 2025** | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct | |
| 226 | +| **IMO 2025** | MARS (lite) | 6 | 2 | 33.3% | **+1 problem (+16.6pp)** | |
| 227 | +| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 | |
| 228 | +| **LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | **+12 problems (+29.3%)** | |
| 229 | + |
| 230 | +### Key Findings |
| 231 | + |
| 232 | +#### AIME 2025: Significant Accuracy Improvement |
| 233 | +- **Results**: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%) |
| 234 | +- **Improvement**: +9 problems (+69.2% relative improvement), +30.0 percentage points |
| 235 | +- **Key Success Factor**: Multi-agent collaboration with verification effectively solves numerical competition problems |
| 236 | +- **Approach**: 3 agents with diverse temperatures, iterative verification and refinement |
| 237 | + |
| 238 | +#### IMO 2025: Proof-Based Competition Problems |
| 239 | + |
| 240 | +- **Results**: 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%) |
| 241 | +- **Improvement**: +1 problem (+100% relative improvement), +16.6 percentage points |
| 242 | +- **Problems Solved**: Problem 2 (geometry proof) + Problem 4 (number theory) |
| 243 | +- **Runtime**: ~10 minutes per problem (vs ~40 seconds baseline) |
| 244 | +- **Key Success Factor**: Multi-agent exploration with disabled thinking tags allows full proof visibility |
| 245 | +- **Configuration**: `use_thinking_tags=False`, `answer_extraction_mode="none"` for proof problems |
| 246 | + |
| 247 | +#### LiveCodeBench: Strong Performance with Thinking Tags |
| 248 | +- **Results**: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%) |
| 249 | +- **Improvement**: +12 problems (+29.3% relative improvement), +11.43 percentage points |
| 250 | +- **Code Extraction**: 87/105 (82.9%) vs baseline 54/105 (51.4%) - **+61.1% improvement** |
| 251 | +- **Key Success Factor**: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code |
| 252 | +- **Multi-agent benefit**: Different temperature agents explore varied solution approaches |
| 253 | + |
| 254 | +#### Lessons Learned |
| 255 | +1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%) |
| 256 | +2. **MARS improves proof-based problems**: +100% relative improvement on IMO 2025 (16.7% → 33.3%) |
| 257 | +3. **Thinking tags are problem-type dependent**: |
| 258 | + - ✅ **Enable for code generation**: +29.3% improvement on LiveCodeBench |
| 259 | + - ✅ **Enable for numerical problems**: Multi-agent reasoning effective on AIME |
| 260 | + - ❌ **Disable for proof problems**: IMO proofs need full visibility to evaluators |
| 261 | +4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches |
| 262 | +5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%) |
| 263 | + |
| 264 | +### Completed Evaluations |
| 265 | + |
| 266 | +- ✅ **AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement** |
| 267 | +- ✅ **IMO 2025**: Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) **+16.6pp improvement** |
| 268 | +- ✅ **LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement** |
| 269 | + |
| 270 | +*All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.* |
| 271 | + |
| 272 | +### Configuration for IMO Proof Problems |
| 273 | +For proof-based problems like IMO, disable thinking tags to ensure full proof visibility: |
| 274 | +```python |
| 275 | +extra_body = { |
| 276 | + "optillm_approach": "mars", |
| 277 | + "mars_config": { |
| 278 | + "use_thinking_tags": False, # Full proof visibility |
| 279 | + "answer_extraction_mode": "none" # Proofs are the answer |
| 280 | + } |
| 281 | +} |
| 282 | +``` |
| 283 | + |
| 284 | +*All evaluations use pass@1 accuracy metric.* |
| 285 | + |
| 286 | +## Implementation Details |
| 287 | + |
| 288 | +### Temperature Diversity Strategy (3-Agent Default) |
| 289 | +- **Agent 0**: Temperature 0.3 (Conservative, rigorous, low effort) |
| 290 | +- **Agent 1**: Temperature 0.6 (Balanced approach, medium effort) |
| 291 | +- **Agent 2**: Temperature 1.0 (Maximum exploration, high effort) |
| 292 | + |
| 293 | +**Note**: Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6) |
| 294 | + |
| 295 | +### Reasoning Effort Allocation (OpenRouter API) |
| 296 | +- **Low effort** (temp ≤ 0.4): `{"reasoning": {"effort": "low"}}` - Conservative reasoning |
| 297 | +- **Medium effort** (0.4 < temp ≤ 0.8): `{"reasoning": {"effort": "medium"}}` - Balanced reasoning |
| 298 | +- **High effort** (temp > 0.8): `{"reasoning": {"effort": "high"}}` - Maximum reasoning depth |
| 299 | + |
| 300 | +**Note**: OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities. |
| 301 | + |
| 302 | +### Verification Criteria |
| 303 | +Solutions are verified based on: |
| 304 | +- **Mathematical correctness**: Accurate calculations and logic |
| 305 | +- **Completeness**: All problem aspects addressed |
| 306 | +- **Rigor**: Proper justification for each step |
| 307 | +- **Clarity**: Clear mathematical communication |
| 308 | +- **Format compliance**: Proper answer formatting |
| 309 | + |
| 310 | +## Inspired By |
| 311 | + |
| 312 | +- **IMO25 Solver**: 5/6 problems solved with 5-consecutive-pass verification |
| 313 | +- **Gemini 2.5 Pro Deep Think**: Native reasoning tokens and thinking budgets |
| 314 | +- **OpenRouter Reasoning API**: Standardized interface for deep thinking |
| 315 | +- **CEPO Architecture**: Multi-file approach pattern in OptiLLM |
| 316 | + |
| 317 | +## Future Enhancements |
| 318 | + |
| 319 | +- **Multi-model support**: Different models for different agent roles |
| 320 | +- **Dynamic temperature adjustment**: Adaptive exploration strategies |
| 321 | +- **Specialized agent roles**: Proof-focused, computation-focused, verification-focused |
| 322 | +- **Knowledge base integration**: Access to mathematical theorems and techniques |
| 323 | +- **Interactive verification**: Human-in-the-loop verification for critical problems |
0 commit comments