@@ -5,26 +5,28 @@ A sophisticated multi-agent reasoning system designed for challenging mathematic
55## Overview
66
77MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through:
8- - ** Multi-agent exploration** with diverse reasoning approaches
9- - ** Rigorous verification** using a 5 -pass consensus threshold
8+ - ** Multi-agent exploration** with diverse reasoning approaches (3 agents by default, configurable)
9+ - ** Rigorous verification** using a 2 -pass consensus threshold (configurable)
1010- ** Iterative improvement** based on verification feedback
1111- ** OpenRouter reasoning API** for deep mathematical thinking
12+ - ** RSA-inspired aggregation** for solution refinement
13+ - ** Strategy network** for cross-agent insight sharing
1214- ** Shared workspace** for agent collaboration
1315
1416## Key Features
1517
1618### 1. Multi-Agent Architecture
17- - ** 5 parallel agents** with different temperature settings (0.3-1.0 )
18- - ** Temperature diversity** ensures varied exploration strategies
19+ - ** 3 parallel agents** by default (configurable: 2 for lightweight, 3+ for advanced )
20+ - ** Temperature diversity** (0.3, 0.6, 1.0) ensures varied exploration strategies
1921- ** Independent reasoning** followed by collaborative verification
2022
2123### 2. OpenRouter Reasoning API Integration
22- - ** Thinking tokens ** : Up to 32,768 tokens for deep reasoning
23- - ** Effort levels ** : Low (20% ), Medium (50% ), High (80%) reasoning budgets
24- - ** Adaptive allocation ** based on agent temperature and problem complexity
24+ - ** Effort-based reasoning ** : "low", "medium", "high" effort levels via OpenRouter API
25+ - ** Adaptive allocation ** : Low effort (temp ≤ 0.4 ), Medium (0.4-0.8 ), High (> 0.8)
26+ - ** Configurable token budgets ** : 4K for lightweight coding, 64K for complex reasoning
2527
2628### 3. Verification System
27- - ** 5 -pass threshold** : Solutions must pass 5 consecutive verifications
29+ - ** 2 -pass threshold** by default (configurable: 1 for lightweight, 2+ for advanced)
2830- ** Cross-agent verification** : Agents verify each other's solutions
2931- ** Mathematical rigor** : Focus on complete proofs, not just correct answers
3032- ** Consensus building** : Multiple verified solutions required
@@ -38,33 +40,72 @@ MARS leverages multiple AI agents working collaboratively to solve complex mathe
3840
3941```
4042optillm/mars/
41- ├── __init__.py # Package exports
42- ├── mars.py # Main orchestration logic
43- ├── agent.py # Individual agent implementation
44- ├── workspace.py # Shared collaboration workspace
45- ├── verifier.py # 5-pass verification system
46- ├── prompts.py # Mathematical reasoning prompts
47- └── README.md # This documentation
43+ ├── __init__.py # Package exports
44+ ├── mars.py # Main orchestration with parallel execution
45+ ├── agent.py # Individual agent implementation
46+ ├── workspace.py # Shared collaboration workspace
47+ ├── verifier.py # Multi-pass verification system
48+ ├── aggregator.py # RSA-inspired solution aggregation
49+ ├── strategy_network.py # Cross-agent insight sharing
50+ ├── answer_extraction.py # Clean answer extraction with thinking tags
51+ ├── prompts.py # Mathematical reasoning prompts
52+ └── README.md # This documentation
4853```
4954
5055## Configuration
5156
52- ### Default Configuration
57+ ### Default Configuration (Mathematical Reasoning)
5358``` python
5459DEFAULT_CONFIG = {
55- ' num_agents' : 5 , # Number of parallel agents
56- ' max_iterations' : 30 , # Maximum improvement iterations
57- ' verification_passes_required' : 5 , # Consecutive passes needed
58- ' consensus_threshold' : 2 , # Verified solutions for consensus
59- ' min_verified_solutions' : 1 , # Minimum to proceed
60- ' thinking_budget_initial' : 10000 , # Initial reasoning tokens
61- ' thinking_budget_max' : 32000 , # Maximum reasoning tokens
62- ' max_response_tokens' : 4096 , # Maximum response length
63- ' early_termination' : True , # Stop on consensus
64- ' use_reasoning_api' : True # Enable OpenRouter reasoning
60+ ' num_agents' : 3 , # Number of parallel agents
61+ ' max_iterations' : 5 , # Maximum improvement iterations
62+ ' verification_passes_required' : 2 , # Consecutive passes needed
63+ ' consensus_threshold' : 2 , # Verified solutions for consensus
64+ ' min_verified_solutions' : 1 , # Minimum to proceed
65+ ' max_tokens' : 64000 , # Token budget for complex reasoning
66+ ' max_verification_attempts' : 3 , # Max verification retries
67+ ' early_termination' : True , # Stop on consensus
68+ ' use_reasoning_api' : True , # Enable OpenRouter reasoning
69+ # RSA-inspired aggregation
70+ ' enable_aggregation' : True , # Enable solution aggregation
71+ ' population_size' : 6 , # Population for diversity
72+ ' aggregation_size' : 3 , # Solutions per aggregation
73+ ' aggregation_loops' : 3 , # Aggregation iterations
74+ # Strategy Network
75+ ' enable_strategy_network' : True , # Cross-agent insight sharing
76+ ' strategy_extraction_enabled' : True , # Extract reasoning strategies
77+ ' cross_agent_enhancement' : True , # Enhanced solutions via peer strategies
78+ # Thinking tags and answer extraction
79+ ' use_thinking_tags' : True , # Wrap reasoning in <think> tags
80+ ' answer_extraction_mode' : ' auto' , # 'auto', 'code', 'math', or 'none'
6581}
6682```
6783
84+ ### Lightweight Configuration (Coding Benchmarks)
85+ ``` python
86+ LIGHTWEIGHT_CONFIG = {
87+ ' num_agents' : 2 , # Reduced agent count
88+ ' max_iterations' : 2 , # Faster iteration limit
89+ ' verification_passes_required' : 1 , # Single-pass verification
90+ ' consensus_threshold' : 1 , # Lower threshold for 2 agents
91+ ' min_verified_solutions' : 1 ,
92+ ' max_tokens' : 4000 , # Smaller token budget
93+ ' max_verification_attempts' : 2 ,
94+ ' early_termination' : True ,
95+ ' use_reasoning_api' : True ,
96+ # Disable expensive features for speed
97+ ' enable_aggregation' : False , # Skip RSA aggregation
98+ ' enable_strategy_network' : False , # Skip strategy network
99+ ' strategy_extraction_enabled' : False ,
100+ ' cross_agent_enhancement' : False ,
101+ # Thinking tags still enabled
102+ ' use_thinking_tags' : True ,
103+ ' answer_extraction_mode' : ' auto' ,
104+ }
105+ ```
106+
107+ ** Note** : MARS automatically uses lightweight config when ` max_tokens ≤ 4000 ` in the request.
108+
68109## Usage
69110
70111### Via OptiLLM Server
@@ -114,29 +155,47 @@ response = client.chat.completions.create(
114155
115156## Process Flow
116157
117- ### Phase 1: Multi-Agent Exploration
118- 1 . Initialize 5 agents with diverse temperatures
158+ ### Phase 1: Multi-Agent Exploration (Parallel)
159+ 1 . Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0)
1191602 . Each agent independently analyzes the problem
120- 3 . Generate initial solutions using OpenRouter reasoning API
121- 4 . Solutions stored in shared workspace
122-
123- ### Phase 2: Verification System
161+ 3 . Generate initial solutions using OpenRouter reasoning API with effort levels
162+ 4 . All agent API calls executed in parallel via ThreadPoolExecutor
163+ 5 . Solutions stored in shared workspace
164+
165+ ### Phase 2a: RSA-Inspired Aggregation (Optional, Parallel)
166+ 1 . Maintain population of N=6 solutions for diversity
167+ 2 . Select K=3 solutions for aggregation
168+ 3 . Run T=3 aggregation loops to refine solutions
169+ 4 . Parallel execution of aggregation API calls
170+ 5 . Enhanced solutions added back to workspace
171+
172+ ### Phase 2b: Cross-Agent Strategy Network (Optional, Parallel)
173+ 1 . Extract reasoning strategies from agent solutions
174+ 2 . Identify successful patterns and techniques
175+ 3 . Share strategies across agents
176+ 4 . Generate enhanced solutions using peer insights
177+ 5 . Parallel execution of strategy extraction and enhancement
178+
179+ ### Phase 3: Verification System (Parallel)
1241801 . Cross-agent verification of all solutions
125- 2 . Each solution requires 5 consecutive "CORRECT" assessments
181+ 2 . Each solution requires 2 consecutive "CORRECT" assessments (configurable)
1261823 . Verification feedback captured for improvement
1271834 . Solutions marked as verified/unverified
184+ 5 . Parallel execution of verification calls
128185
129- ### Phase 3 : Iterative Improvement
186+ ### Phase 4 : Iterative Improvement (Parallel)
1301871 . Unverified solutions improved based on feedback
1311882 . Agents address specific issues identified in verification
1321893 . Re-verification of improved solutions
133- 4 . Process continues until consensus or max iterations
190+ 4 . Process continues until consensus or max iterations (5 default)
191+ 5 . Parallel execution of improvement and verification
134192
135- ### Phase 4: Final Synthesis
136- 1 . Best verified solution selected as final answer
137- 2 . If no verified solutions, synthesis from all attempts
138- 3 . High-effort reasoning applied to synthesis
139- 4 . Complete solution with mathematical rigor
193+ ### Phase 5: Final Synthesis
194+ 1 . ** Numerical voting** : If 2+ agents agree on same numerical answer, use that solution
195+ 2 . ** Best verified solution** : Otherwise, select highest-scoring verified solution
196+ 3 . ** Synthesis** : If no verified solution, synthesize from top 3 solutions
197+ 4 . ** Answer extraction** : Apply thinking tags and extract clean answer (if enabled)
198+ 5 . Complete solution with mathematical rigor
140199
141200## Evaluation
142201
@@ -163,9 +222,8 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
163222| -----------| ----------| ----------| ---------| ----------| -------|
164223| ** AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
165224| ** AIME 2025** | MARS | 30 | 22 | 73.3% | ** +9 problems (+30pp)** |
166- | ** IMO 2025** | Baseline | 6 | 3 | 50.0% | Problems 2, 4 & 5 correct |
167- | ** IMO 2025** | MARS (w/ thinking) | 6 | 0 | 0.0% | Thinking tags hid proofs |
168- | ** IMO 2025** | MARS (fixed) | 6 | TBD | TBD% | Proof visibility fixes needed |
225+ | ** IMO 2025** | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct |
226+ | ** IMO 2025** | MARS (lite) | 6 | 2 | 33.3% | ** +1 problem (+16.6pp)** |
169227| ** LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
170228| ** LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | ** +12 problems (+29.3%)** |
171229
@@ -175,7 +233,16 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
175233- ** Results** : 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
176234- ** Improvement** : +9 problems (+69.2% relative improvement), +30.0 percentage points
177235- ** Key Success Factor** : Multi-agent collaboration with verification effectively solves numerical competition problems
178- - ** Approach** : 5 agents with diverse temperatures, iterative verification and refinement
236+ - ** Approach** : 3 agents with diverse temperatures, iterative verification and refinement
237+
238+ #### IMO 2025: Proof-Based Competition Problems
239+
240+ - ** Results** : 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%)
241+ - ** Improvement** : +1 problem (+100% relative improvement), +16.6 percentage points
242+ - ** Problems Solved** : Problem 2 (geometry proof) + Problem 4 (number theory)
243+ - ** Runtime** : ~ 10 minutes per problem (vs ~ 40 seconds baseline)
244+ - ** Key Success Factor** : Multi-agent exploration with disabled thinking tags allows full proof visibility
245+ - ** Configuration** : ` use_thinking_tags=False ` , ` answer_extraction_mode="none" ` for proof problems
179246
180247#### LiveCodeBench: Strong Performance with Thinking Tags
181248- ** Results** : 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
@@ -184,14 +251,26 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
184251- ** Key Success Factor** : Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
185252- ** Multi-agent benefit** : Different temperature agents explore varied solution approaches
186253
187- #### IMO 2025 Proof-Based Problems
188- - ** Initial Challenge** : MARS scored lower than baseline (0/6 vs 3/6, baseline solved problems 2, 4, 5)
189- - ** Root Cause** : Thinking tags hid 80-85% of proof content from evaluator - proofs inside ` <think> ` tags not visible
190- - ** Solution** : Disable thinking tags for proof-based problems via ` mars_config `
191- - ** Status** : Re-evaluation needed with proof visibility fixes
192- - ** Key Lesson** : Thinking tags are ** problem-type dependent** - helpful for code/numerical, harmful for proofs
254+ #### Lessons Learned
255+ 1 . ** MARS excels at numerical competition problems** : +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
256+ 2 . ** MARS improves proof-based problems** : +100% relative improvement on IMO 2025 (16.7% → 33.3%)
257+ 3 . ** Thinking tags are problem-type dependent** :
258+ - ✅ ** Enable for code generation** : +29.3% improvement on LiveCodeBench
259+ - ✅ ** Enable for numerical problems** : Multi-agent reasoning effective on AIME
260+ - ❌ ** Disable for proof problems** : IMO proofs need full visibility to evaluators
261+ 4 . ** Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
262+ 5 . ** Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
193263
194- #### Configuration for IMO Problems
264+ ### Completed Evaluations
265+
266+ - ✅ ** AIME 2025** : Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) ** +30pp improvement**
267+ - ✅ ** IMO 2025** : Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) ** +16.6pp improvement**
268+ - ✅ ** LiveCodeBench v5/v6** : Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) ** +11.43pp improvement**
269+
270+ * All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.*
271+
272+ ### Configuration for IMO Proof Problems
273+ For proof-based problems like IMO, disable thinking tags to ensure full proof visibility:
195274``` python
196275extra_body = {
197276 " optillm_approach" : " mars" ,
@@ -202,39 +281,23 @@ extra_body = {
202281}
203282```
204283
205- #### Lessons Learned
206- 1 . ** MARS excels at numerical competition problems** : +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
207- 2 . ** Thinking tags are problem-type dependent** :
208- - ✅ ** Enable for code generation** : +29.3% improvement on LiveCodeBench
209- - ✅ ** Enable for numerical problems** : Multi-agent reasoning effective on AIME
210- - ❌ ** Disable for mathematical proofs** : Hides critical reasoning from evaluators
211- 3 . ** Answer extraction** must be disabled for proof-based problems - the proof IS the answer
212- 4 . ** Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
213- 5 . ** Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
284+ * All evaluations use pass@1 accuracy metric.*
214285
215- ### Completed Evaluations (google/gemini-2.5-flash-lite-preview-09-2025)
216- - ✅ ** AIME 2025** : Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) ** +30pp improvement**
217- - ✅ ** IMO 2025** : Baseline 3/6 (50.0%), MARS with thinking tags 0/6 (0.0% - proofs hidden)
218- - ✅ ** LiveCodeBench v5/v6** : Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) ** +11.43pp improvement**
286+ ## Implementation Details
219287
220- ### Ongoing Work
221- - 🔄 IMO 2025 MARS re-evaluation with proof visibility fixes (disable thinking tags)
288+ ### Temperature Diversity Strategy (3-Agent Default)
289+ - ** Agent 0** : Temperature 0.3 (Conservative, rigorous, low effort)
290+ - ** Agent 1** : Temperature 0.6 (Balanced approach, medium effort)
291+ - ** Agent 2** : Temperature 1.0 (Maximum exploration, high effort)
222292
223- * All evaluations use pass@1 accuracy metric. *
293+ ** Note ** : Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6)
224294
225- ## Implementation Details
295+ ### Reasoning Effort Allocation (OpenRouter API)
296+ - ** Low effort** (temp ≤ 0.4): ` {"reasoning": {"effort": "low"}} ` - Conservative reasoning
297+ - ** Medium effort** (0.4 < temp ≤ 0.8): ` {"reasoning": {"effort": "medium"}} ` - Balanced reasoning
298+ - ** High effort** (temp > 0.8): ` {"reasoning": {"effort": "high"}} ` - Maximum reasoning depth
226299
227- ### Temperature Diversity Strategy
228- - ** Agent 0** : Temperature 0.3 (Conservative, rigorous)
229- - ** Agent 1** : Temperature 0.5 (Balanced approach)
230- - ** Agent 2** : Temperature 0.7 (Creative exploration)
231- - ** Agent 3** : Temperature 0.9 (High creativity)
232- - ** Agent 4** : Temperature 1.0 (Maximum exploration)
233-
234- ### Reasoning Budget Allocation
235- - ** Low effort (temp ≤ 0.4)** : 20% of reasoning budget
236- - ** Medium effort (0.4 < temp ≤ 0.7)** : 50% of reasoning budget
237- - ** High effort (temp > 0.7)** : 80% of reasoning budget
300+ ** Note** : OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities.
238301
239302### Verification Criteria
240303Solutions are verified based on:
0 commit comments