Skip to content

Commit 780f9d3

Browse files
committed
f
1 parent 2e35fbb commit 780f9d3

File tree

2 files changed

+71
-1
lines changed

2 files changed

+71
-1
lines changed

optillm/mars/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,7 @@ MARS is designed to excel on challenging mathematical benchmarks:
144144

145145
- **IMO (International Mathematical Olympiad)**: Complex proof-based problems
146146
- **AIME (American Invitational Mathematics Examination)**: Numerical competition problems
147+
- **LiveCodeBench**: Competitive programming challenges
147148
- **Mathematical reasoning tasks**: General problem-solving capabilities
148149

149150
### Performance Metrics
@@ -152,6 +153,75 @@ MARS is designed to excel on challenging mathematical benchmarks:
152153
- **Reasoning Efficiency**: Tokens used per correct solution
153154
- **Consensus Quality**: Agreement between verified solutions
154155

156+
## Benchmark Results
157+
158+
### Gemini 2.5 Flash Lite Preview Model
159+
160+
Evaluation results using `google/gemini-2.5-flash-lite-preview-09-2025` via OpenRouter:
161+
162+
| Benchmark | Approach | Problems | Correct | Accuracy | Notes |
163+
|-----------|----------|----------|---------|----------|-------|
164+
| **AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
165+
| **AIME 2025** | MARS | 30 | 22 | 73.3% | **+9 problems (+30pp)** |
166+
| **IMO 2025** | Baseline | 6 | 3 | 50.0% | Problems 2, 4 & 5 correct |
167+
| **IMO 2025** | MARS (w/ thinking) | 6 | 0 | 0.0% | Thinking tags hid proofs |
168+
| **IMO 2025** | MARS (fixed) | 6 | TBD | TBD% | Proof visibility fixes needed |
169+
| **LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
170+
| **LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | **+12 problems (+29.3%)** |
171+
172+
### Key Findings
173+
174+
#### AIME 2025: Significant Accuracy Improvement
175+
- **Results**: 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
176+
- **Improvement**: +9 problems (+69.2% relative improvement), +30.0 percentage points
177+
- **Key Success Factor**: Multi-agent collaboration with verification effectively solves numerical competition problems
178+
- **Approach**: 5 agents with diverse temperatures, iterative verification and refinement
179+
180+
#### LiveCodeBench: Strong Performance with Thinking Tags
181+
- **Results**: 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
182+
- **Improvement**: +12 problems (+29.3% relative improvement), +11.43 percentage points
183+
- **Code Extraction**: 87/105 (82.9%) vs baseline 54/105 (51.4%) - **+61.1% improvement**
184+
- **Key Success Factor**: Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
185+
- **Multi-agent benefit**: Different temperature agents explore varied solution approaches
186+
187+
#### IMO 2025 Proof-Based Problems
188+
- **Initial Challenge**: MARS scored lower than baseline (0/6 vs 3/6, baseline solved problems 2, 4, 5)
189+
- **Root Cause**: Thinking tags hid 80-85% of proof content from evaluator - proofs inside `<think>` tags not visible
190+
- **Solution**: Disable thinking tags for proof-based problems via `mars_config`
191+
- **Status**: Re-evaluation needed with proof visibility fixes
192+
- **Key Lesson**: Thinking tags are **problem-type dependent** - helpful for code/numerical, harmful for proofs
193+
194+
#### Configuration for IMO Problems
195+
```python
196+
extra_body = {
197+
"optillm_approach": "mars",
198+
"mars_config": {
199+
"use_thinking_tags": False, # Full proof visibility
200+
"answer_extraction_mode": "none" # Proofs are the answer
201+
}
202+
}
203+
```
204+
205+
#### Lessons Learned
206+
1. **MARS excels at numerical competition problems**: +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
207+
2. **Thinking tags are problem-type dependent**:
208+
-**Enable for code generation**: +29.3% improvement on LiveCodeBench
209+
-**Enable for numerical problems**: Multi-agent reasoning effective on AIME
210+
-**Disable for mathematical proofs**: Hides critical reasoning from evaluators
211+
3. **Answer extraction** must be disabled for proof-based problems - the proof IS the answer
212+
4. **Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
213+
5. **Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
214+
215+
### Completed Evaluations (google/gemini-2.5-flash-lite-preview-09-2025)
216+
-**AIME 2025**: Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) **+30pp improvement**
217+
-**IMO 2025**: Baseline 3/6 (50.0%), MARS with thinking tags 0/6 (0.0% - proofs hidden)
218+
-**LiveCodeBench v5/v6**: Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) **+11.43pp improvement**
219+
220+
### Ongoing Work
221+
- 🔄 IMO 2025 MARS re-evaluation with proof visibility fixes (disable thinking tags)
222+
223+
*All evaluations use pass@1 accuracy metric.*
224+
155225
## Implementation Details
156226

157227
### Temperature Diversity Strategy

optillm/mars/prompts.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@
7474
4. Ensure logical rigor and completeness
7575
5. Provide a clear, well-structured final answer
7676
6. CRITICAL: If multiple agents extracted the same numerical answer, prioritize that answer in your synthesis
77-
7. Format your final answer clearly (use \\boxed{answer} for mathematical answers when appropriate)
77+
7. Format your final answer clearly (use \\boxed{{answer}} for mathematical answers when appropriate)
7878
7979
Important: Preserve the depth and detail needed for complex problems. Do not over-condense - maintain all critical reasoning steps and justifications. If agents have extracted specific numerical answers, ensure these are preserved and clearly formatted in your final response.
8080

0 commit comments

Comments
 (0)