Update README.md

codelion · codelion · commit 3513d60bf65d · 2025-10-03T15:17:53.000+05:30
diff --git a/optillm/mars/README.md b/optillm/mars/README.md
@@ -5,26 +5,28 @@ A sophisticated multi-agent reasoning system designed for challenging mathematic
 ## Overview
 
 MARS leverages multiple AI agents working collaboratively to solve complex mathematical problems through:
-- **Multi-agent exploration** with diverse reasoning approaches
-- **Rigorous verification** using a 5-pass consensus threshold
+- **Multi-agent exploration** with diverse reasoning approaches (3 agents by default, configurable)
+- **Rigorous verification** using a 2-pass consensus threshold (configurable)
 - **Iterative improvement** based on verification feedback
 - **OpenRouter reasoning API** for deep mathematical thinking
+- **RSA-inspired aggregation** for solution refinement
+- **Strategy network** for cross-agent insight sharing
 - **Shared workspace** for agent collaboration
 
 ## Key Features
 
 ### 1. Multi-Agent Architecture
-- **5 parallel agents** with different temperature settings (0.3-1.0)
-- **Temperature diversity** ensures varied exploration strategies
+- **3 parallel agents** by default (configurable: 2 for lightweight, 3+ for advanced)
+- **Temperature diversity** (0.3, 0.6, 1.0) ensures varied exploration strategies
 - **Independent reasoning** followed by collaborative verification
 
 ### 2. OpenRouter Reasoning API Integration
-- **Thinking tokens**: Up to 32,768 tokens for deep reasoning
-- **Effort levels**: Low (20%), Medium (50%), High (80%) reasoning budgets
-- **Adaptive allocation** based on agent temperature and problem complexity
+- **Effort-based reasoning**: "low", "medium", "high" effort levels via OpenRouter API
+- **Adaptive allocation**: Low effort (temp ≤ 0.4), Medium (0.4-0.8), High (> 0.8)
+- **Configurable token budgets**: 4K for lightweight coding, 64K for complex reasoning
 
 ### 3. Verification System
-- **5-pass threshold**: Solutions must pass 5 consecutive verifications
+- **2-pass threshold** by default (configurable: 1 for lightweight, 2+ for advanced)
 - **Cross-agent verification**: Agents verify each other's solutions
 - **Mathematical rigor**: Focus on complete proofs, not just correct answers
 - **Consensus building**: Multiple verified solutions required
@@ -38,33 +40,72 @@ MARS leverages multiple AI agents working collaboratively to solve complex mathe
 
 ```
 optillm/mars/
-├── __init__.py           # Package exports
-├── mars.py               # Main orchestration logic
-├── agent.py              # Individual agent implementation
-├── workspace.py          # Shared collaboration workspace
-├── verifier.py           # 5-pass verification system
-├── prompts.py            # Mathematical reasoning prompts
-└── README.md             # This documentation
+├── __init__.py              # Package exports
+├── mars.py                  # Main orchestration with parallel execution
+├── agent.py                 # Individual agent implementation
+├── workspace.py             # Shared collaboration workspace
+├── verifier.py              # Multi-pass verification system
+├── aggregator.py            # RSA-inspired solution aggregation
+├── strategy_network.py      # Cross-agent insight sharing
+├── answer_extraction.py     # Clean answer extraction with thinking tags
+├── prompts.py               # Mathematical reasoning prompts
+└── README.md                # This documentation
 ```
 
 ## Configuration
 
-### Default Configuration
+### Default Configuration (Mathematical Reasoning)
 ```python
 DEFAULT_CONFIG = {
-    'num_agents': 5,                     # Number of parallel agents
-    'max_iterations': 30,                # Maximum improvement iterations
-    'verification_passes_required': 5,   # Consecutive passes needed
-    'consensus_threshold': 2,            # Verified solutions for consensus
-    'min_verified_solutions': 1,         # Minimum to proceed
-    'thinking_budget_initial': 10000,    # Initial reasoning tokens
-    'thinking_budget_max': 32000,        # Maximum reasoning tokens
-    'max_response_tokens': 4096,         # Maximum response length
-    'early_termination': True,           # Stop on consensus
-    'use_reasoning_api': True            # Enable OpenRouter reasoning
+    'num_agents': 3,                        # Number of parallel agents
+    'max_iterations': 5,                    # Maximum improvement iterations
+    'verification_passes_required': 2,      # Consecutive passes needed
+    'consensus_threshold': 2,               # Verified solutions for consensus
+    'min_verified_solutions': 1,            # Minimum to proceed
+    'max_tokens': 64000,                    # Token budget for complex reasoning
+    'max_verification_attempts': 3,         # Max verification retries
+    'early_termination': True,              # Stop on consensus
+    'use_reasoning_api': True,              # Enable OpenRouter reasoning
+    # RSA-inspired aggregation
+    'enable_aggregation': True,             # Enable solution aggregation
+    'population_size': 6,                   # Population for diversity
+    'aggregation_size': 3,                  # Solutions per aggregation
+    'aggregation_loops': 3,                 # Aggregation iterations
+    # Strategy Network
+    'enable_strategy_network': True,        # Cross-agent insight sharing
+    'strategy_extraction_enabled': True,    # Extract reasoning strategies
+    'cross_agent_enhancement': True,        # Enhanced solutions via peer strategies
+    # Thinking tags and answer extraction
+    'use_thinking_tags': True,              # Wrap reasoning in <think> tags
+    'answer_extraction_mode': 'auto',       # 'auto', 'code', 'math', or 'none'
 }
 ```
 
+### Lightweight Configuration (Coding Benchmarks)
+```python
+LIGHTWEIGHT_CONFIG = {
+    'num_agents': 2,                        # Reduced agent count
+    'max_iterations': 2,                    # Faster iteration limit
+    'verification_passes_required': 1,      # Single-pass verification
+    'consensus_threshold': 1,               # Lower threshold for 2 agents
+    'min_verified_solutions': 1,
+    'max_tokens': 4000,                     # Smaller token budget
+    'max_verification_attempts': 2,
+    'early_termination': True,
+    'use_reasoning_api': True,
+    # Disable expensive features for speed
+    'enable_aggregation': False,            # Skip RSA aggregation
+    'enable_strategy_network': False,       # Skip strategy network
+    'strategy_extraction_enabled': False,
+    'cross_agent_enhancement': False,
+    # Thinking tags still enabled
+    'use_thinking_tags': True,
+    'answer_extraction_mode': 'auto',
+}
+```
+
+**Note**: MARS automatically uses lightweight config when `max_tokens ≤ 4000` in the request.
+
 ## Usage
 
 ### Via OptiLLM Server
@@ -114,29 +155,47 @@ response = client.chat.completions.create(
 
 ## Process Flow
 
-### Phase 1: Multi-Agent Exploration
-1. Initialize 5 agents with diverse temperatures
+### Phase 1: Multi-Agent Exploration (Parallel)
+1. Initialize 3 agents with diverse temperatures (0.3, 0.6, 1.0)
 2. Each agent independently analyzes the problem
-3. Generate initial solutions using OpenRouter reasoning API
-4. Solutions stored in shared workspace
-
-### Phase 2: Verification System
+3. Generate initial solutions using OpenRouter reasoning API with effort levels
+4. All agent API calls executed in parallel via ThreadPoolExecutor
+5. Solutions stored in shared workspace
+
+### Phase 2a: RSA-Inspired Aggregation (Optional, Parallel)
+1. Maintain population of N=6 solutions for diversity
+2. Select K=3 solutions for aggregation
+3. Run T=3 aggregation loops to refine solutions
+4. Parallel execution of aggregation API calls
+5. Enhanced solutions added back to workspace
+
+### Phase 2b: Cross-Agent Strategy Network (Optional, Parallel)
+1. Extract reasoning strategies from agent solutions
+2. Identify successful patterns and techniques
+3. Share strategies across agents
+4. Generate enhanced solutions using peer insights
+5. Parallel execution of strategy extraction and enhancement
+
+### Phase 3: Verification System (Parallel)
 1. Cross-agent verification of all solutions
-2. Each solution requires 5 consecutive "CORRECT" assessments
+2. Each solution requires 2 consecutive "CORRECT" assessments (configurable)
 3. Verification feedback captured for improvement
 4. Solutions marked as verified/unverified
+5. Parallel execution of verification calls
 
-### Phase 3: Iterative Improvement
+### Phase 4: Iterative Improvement (Parallel)
 1. Unverified solutions improved based on feedback
 2. Agents address specific issues identified in verification
 3. Re-verification of improved solutions
-4. Process continues until consensus or max iterations
+4. Process continues until consensus or max iterations (5 default)
+5. Parallel execution of improvement and verification
 
-### Phase 4: Final Synthesis
-1. Best verified solution selected as final answer
-2. If no verified solutions, synthesis from all attempts
-3. High-effort reasoning applied to synthesis
-4. Complete solution with mathematical rigor
+### Phase 5: Final Synthesis
+1. **Numerical voting**: If 2+ agents agree on same numerical answer, use that solution
+2. **Best verified solution**: Otherwise, select highest-scoring verified solution
+3. **Synthesis**: If no verified solution, synthesize from top 3 solutions
+4. **Answer extraction**: Apply thinking tags and extract clean answer (if enabled)
+5. Complete solution with mathematical rigor
 
 ## Evaluation
 
@@ -163,9 +222,8 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
 |-----------|----------|----------|---------|----------|-------|
 | **AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
 | **AIME 2025** | MARS | 30 | 22 | 73.3% | **+9 problems (+30pp)** |
-| **IMO 2025** | Baseline | 6 | 3 | 50.0% | Problems 2, 4 & 5 correct |
-| **IMO 2025** | MARS (w/ thinking) | 6 | 0 | 0.0% | Thinking tags hid proofs |
-| **IMO 2025** | MARS (fixed) | 6 | TBD | TBD% | Proof visibility fixes needed |
+| **IMO 2025** | Baseline (lite) | 6 | 1 | 16.7% | Problem 4 correct |
+| **IMO 2025** | MARS (lite) | 6 | 2 | 33.3% | **+1 problem (+16.6pp)** |
 | **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
 | **LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | **+12 problems (+29.3%)** |
 
@@ -175,7 +233,16 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
 - **Results**: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
 - **Improvement**: +9 problems (+69.2% relative improvement), +30.0 percentage points
 - **Key Success Factor**: Multi-agent collaboration with verification effectively solves numerical competition problems
-- **Approach**: 5 agents with diverse temperatures, iterative verification and refinement
+- **Approach**: 3 agents with diverse temperatures, iterative verification and refinement
+
+#### IMO 2025: Proof-Based Competition Problems
+
+- **Results**: 2/6 problems solved (33.3%) vs baseline 1/6 (16.7%)
+- **Improvement**: +1 problem (+100% relative improvement), +16.6 percentage points
+- **Problems Solved**: Problem 2 (geometry proof) + Problem 4 (number theory)
+- **Runtime**: ~10 minutes per problem (vs ~40 seconds baseline)
+- **Key Success Factor**: Multi-agent exploration with disabled thinking tags allows full proof visibility
+- **Configuration**: `use_thinking_tags=False`, `answer_extraction_mode="none"` for proof problems
 
 #### LiveCodeBench: Strong Performance with Thinking Tags
 - **Results**: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
@@ -184,14 +251,26 @@ Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via Open
 - **Key Success Factor**: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
 - **Multi-agent benefit**: Different temperature agents explore varied solution approaches
 
-#### IMO 2025 Proof-Based Problems
-- **Initial Challenge**: MARS scored lower than baseline (0/6 vs 3/6, baseline solved problems 2, 4, 5)
-- **Root Cause**: Thinking tags hid 80-85% of proof content from evaluator - proofs inside `<think>` tags not visible
-- **Solution**: Disable thinking tags for proof-based problems via `mars_config`
-- **Status**: Re-evaluation needed with proof visibility fixes
-- **Key Lesson**: Thinking tags are **problem-type dependent** - helpful for code/numerical, harmful for proofs
+#### Lessons Learned
+1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
+2. **MARS improves proof-based problems**: +100% relative improvement on IMO 2025 (16.7% → 33.3%)
+3. **Thinking tags are problem-type dependent**:
+   - ✅ **Enable for code generation**: +29.3% improvement on LiveCodeBench
+   - ✅ **Enable for numerical problems**: Multi-agent reasoning effective on AIME
+   - ❌ **Disable for proof problems**: IMO proofs need full visibility to evaluators
+4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
+5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
 
-#### Configuration for IMO Problems
+### Completed Evaluations
+
+- ✅ **AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
+- ✅ **IMO 2025**: Baseline 1/6 (16.7%) → MARS 2/6 (33.3%) **+16.6pp improvement**
+- ✅ **LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**
+
+*All evaluations use gemini-2.5-flash-lite-preview-09-2025 model via OpenRouter.*
+
+### Configuration for IMO Proof Problems
+For proof-based problems like IMO, disable thinking tags to ensure full proof visibility:
 ```python
 extra_body = {
     "optillm_approach": "mars",
@@ -202,39 +281,23 @@ extra_body = {
 }
 ```
 
-#### Lessons Learned
-1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
-2. **Thinking tags are problem-type dependent**:
-   - ✅ **Enable for code generation**: +29.3% improvement on LiveCodeBench
-   - ✅ **Enable for numerical problems**: Multi-agent reasoning effective on AIME
-   - ❌ **Disable for mathematical proofs**: Hides critical reasoning from evaluators
-3. **Answer extraction** must be disabled for proof-based problems - the proof IS the answer
-4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
-5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
+*All evaluations use pass@1 accuracy metric.*
 
-### Completed Evaluations (google/gemini-2.5-flash-lite-preview-09-2025)
-- ✅ **AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
-- ✅ **IMO 2025**: Baseline 3/6 (50.0%), MARS with thinking tags 0/6 (0.0% - proofs hidden)
-- ✅ **LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**
+## Implementation Details
 
-### Ongoing Work
-- 🔄 IMO 2025 MARS re-evaluation with proof visibility fixes (disable thinking tags)
+### Temperature Diversity Strategy (3-Agent Default)
+- **Agent 0**: Temperature 0.3 (Conservative, rigorous, low effort)
+- **Agent 1**: Temperature 0.6 (Balanced approach, medium effort)
+- **Agent 2**: Temperature 1.0 (Maximum exploration, high effort)
 
-*All evaluations use pass@1 accuracy metric.*
+**Note**: Temperature assignments cycle for configurations with more agents (e.g., 5 agents: 0.3, 0.6, 1.0, 0.3, 0.6)
 
-## Implementation Details
+### Reasoning Effort Allocation (OpenRouter API)
+- **Low effort** (temp ≤ 0.4): `{"reasoning": {"effort": "low"}}` - Conservative reasoning
+- **Medium effort** (0.4 < temp ≤ 0.8): `{"reasoning": {"effort": "medium"}}` - Balanced reasoning
+- **High effort** (temp > 0.8): `{"reasoning": {"effort": "high"}}` - Maximum reasoning depth
 
-### Temperature Diversity Strategy
-- **Agent 0**: Temperature 0.3 (Conservative, rigorous)
-- **Agent 1**: Temperature 0.5 (Balanced approach)
-- **Agent 2**: Temperature 0.7 (Creative exploration)
-- **Agent 3**: Temperature 0.9 (High creativity)
-- **Agent 4**: Temperature 1.0 (Maximum exploration)
-
-### Reasoning Budget Allocation
-- **Low effort (temp ≤ 0.4)**: 20% of reasoning budget
-- **Medium effort (0.4 < temp ≤ 0.7)**: 50% of reasoning budget
-- **High effort (temp > 0.7)**: 80% of reasoning budget
+**Note**: OpenRouter's reasoning API automatically allocates appropriate thinking tokens based on effort level and model capabilities.
 
 ### Verification Criteria
 Solutions are verified based on: