@@ -144,6 +144,7 @@ MARS is designed to excel on challenging mathematical benchmarks:
144144
145145- ** IMO (International Mathematical Olympiad)** : Complex proof-based problems
146146- ** AIME (American Invitational Mathematics Examination)** : Numerical competition problems
147+ - ** LiveCodeBench** : Competitive programming challenges
147148- ** Mathematical reasoning tasks** : General problem-solving capabilities
148149
149150### Performance Metrics
@@ -152,6 +153,75 @@ MARS is designed to excel on challenging mathematical benchmarks:
152153- ** Reasoning Efficiency** : Tokens used per correct solution
153154- ** Consensus Quality** : Agreement between verified solutions
154155
156+ ## Benchmark Results
157+
158+ ### Gemini 2.5 Flash Lite Preview Model
159+
160+ Evaluation results using ` google/gemini-2.5-flash-lite-preview-09-2025 ` via OpenRouter:
161+
162+ | Benchmark | Approach | Problems | Correct | Accuracy | Notes |
163+ | -----------| ----------| ----------| ---------| ----------| -------|
164+ | ** AIME 2025** | Baseline | 30 | 13 | 43.3% | Pass@1, max_tokens=4000 |
165+ | ** AIME 2025** | MARS | 30 | 22 | 73.3% | ** +9 problems (+30pp)** |
166+ | ** IMO 2025** | Baseline | 6 | 3 | 50.0% | Problems 2, 4 & 5 correct |
167+ | ** IMO 2025** | MARS (w/ thinking) | 6 | 0 | 0.0% | Thinking tags hid proofs |
168+ | ** IMO 2025** | MARS (fixed) | 6 | TBD | TBD% | Proof visibility fixes needed |
169+ | ** LiveCodeBench v5/v6** | Baseline | 105 | 41 | 39.05% | Code generation, pass@1 |
170+ | ** LiveCodeBench v5/v6** | MARS + Thinking | 105 | 53 | 50.48% | ** +12 problems (+29.3%)** |
171+
172+ ### Key Findings
173+
174+ #### AIME 2025: Significant Accuracy Improvement
175+ - ** Results** : 22/30 problems solved (73.3%) vs baseline 13/30 (43.3%)
176+ - ** Improvement** : +9 problems (+69.2% relative improvement), +30.0 percentage points
177+ - ** Key Success Factor** : Multi-agent collaboration with verification effectively solves numerical competition problems
178+ - ** Approach** : 5 agents with diverse temperatures, iterative verification and refinement
179+
180+ #### LiveCodeBench: Strong Performance with Thinking Tags
181+ - ** Results** : 53/105 problems solved (50.48%) vs baseline 41/105 (39.05%)
182+ - ** Improvement** : +12 problems (+29.3% relative improvement), +11.43 percentage points
183+ - ** Code Extraction** : 87/105 (82.9%) vs baseline 54/105 (51.4%) - ** +61.1% improvement**
184+ - ** Key Success Factor** : Thinking tags beneficial for code generation - allows agents to reason through logic before writing code
185+ - ** Multi-agent benefit** : Different temperature agents explore varied solution approaches
186+
187+ #### IMO 2025 Proof-Based Problems
188+ - ** Initial Challenge** : MARS scored lower than baseline (0/6 vs 3/6, baseline solved problems 2, 4, 5)
189+ - ** Root Cause** : Thinking tags hid 80-85% of proof content from evaluator - proofs inside ` <think> ` tags not visible
190+ - ** Solution** : Disable thinking tags for proof-based problems via ` mars_config `
191+ - ** Status** : Re-evaluation needed with proof visibility fixes
192+ - ** Key Lesson** : Thinking tags are ** problem-type dependent** - helpful for code/numerical, harmful for proofs
193+
194+ #### Configuration for IMO Problems
195+ ``` python
196+ extra_body = {
197+ " optillm_approach" : " mars" ,
198+ " mars_config" : {
199+ " use_thinking_tags" : False , # Full proof visibility
200+ " answer_extraction_mode" : " none" # Proofs are the answer
201+ }
202+ }
203+ ```
204+
205+ #### Lessons Learned
206+ 1 . ** MARS excels at numerical competition problems** : +69.2% relative improvement on AIME 2025 (43.3% → 73.3%)
207+ 2 . ** Thinking tags are problem-type dependent** :
208+ - ✅ ** Enable for code generation** : +29.3% improvement on LiveCodeBench
209+ - ✅ ** Enable for numerical problems** : Multi-agent reasoning effective on AIME
210+ - ❌ ** Disable for mathematical proofs** : Hides critical reasoning from evaluators
211+ 3 . ** Answer extraction** must be disabled for proof-based problems - the proof IS the answer
212+ 4 . ** Multi-agent diversity** provides significant value - different temperature agents explore complementary approaches
213+ 5 . ** Code extraction rate** is a leading indicator - MARS achieved 82.9% vs baseline 51.4% (+61.1%)
214+
215+ ### Completed Evaluations (google/gemini-2.5-flash-lite-preview-09-2025)
216+ - ✅ ** AIME 2025** : Baseline 13/30 (43.3%) → MARS 22/30 (73.3%) ** +30pp improvement**
217+ - ✅ ** IMO 2025** : Baseline 3/6 (50.0%), MARS with thinking tags 0/6 (0.0% - proofs hidden)
218+ - ✅ ** LiveCodeBench v5/v6** : Baseline 41/105 (39.05%) → MARS 53/105 (50.48%) ** +11.43pp improvement**
219+
220+ ### Ongoing Work
221+ - 🔄 IMO 2025 MARS re-evaluation with proof visibility fixes (disable thinking tags)
222+
223+ * All evaluations use pass@1 accuracy metric.*
224+
155225## Implementation Details
156226
157227### Temperature Diversity Strategy
0 commit comments