@@ -467,31 +467,6 @@ Authorization: Bearer your_secret_api_key
467467
468468## SOTA results on benchmarks with optillm
469469
470- ### CePO on math and code benchmarks (Mar 2025)
471-
472- | Method | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
473- | -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
474- | Llama 3.3 70B | 51.0 | 78.6 | 72.6 | 27.1 | 20.9 |
475- | Llama 3.1 405B | 49.8 | 79.2 | 73.0 | 31.8 | 13.5 |
476- | CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 80.1 | 31.9 | ** 22.6** |
477- | QwQ 32B | 61.4 | 90.8 | 82.5 | 44.3 | 7.8 |
478- | CePO (using QwQ 32B) | 88.1 | ** 92.0** | 86.3 | ** 51.5** | 8.2 |
479- | DeepSeek R1 Llama | 83.1 | 82.0 | 84.0 | 47.3 | 14.6 |
480- | CePO (using DeepSeek R1 Llama) | ** 90.2** | 84.0 | ** 89.4** | 47.2 | 15.5 |
481-
482- ### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)
483-
484- | Model | Score |
485- | -------| -----:|
486- | o1-mini | 56.67 |
487- | coc-claude-3-5-sonnet-20241022 | 46.67 |
488- | coc-gemini/gemini-exp-1121 | 46.67 |
489- | o1-preview | 40.00 |
490- | gemini-exp-1114 | 36.67 |
491- | claude-3-5-sonnet-20241022 | 20.00 |
492- | gemini-1.5-pro-002 | 20.00 |
493- | gemini-1.5-flash-002 | 16.67 |
494-
495470### LongCePO on LongBench v2 (Apr 2025)
496471
497472| Model¹ | Context window | Short samples (up to 32K words) | Medium samples (32–128K words) |
@@ -518,6 +493,31 @@ Authorization: Bearer your_secret_api_key
518493
519494 ¹ Numbers in parentheses for LongCePO indicate accuracy of majority voting from 5 runs.
520495
496+ ### CePO on math and code benchmarks (Mar 2025)
497+
498+ | Method | Math-L5 | MMLU-Pro (Math) | CRUX | LiveCodeBench (pass@1) | Simple QA |
499+ | -----------------------------: | :-----: | :-------------: | :----: | :--------------------: | :-------: |
500+ | Llama 3.3 70B | 51.0 | 78.6 | 72.6 | 27.1 | 20.9 |
501+ | Llama 3.1 405B | 49.8 | 79.2 | 73.0 | 31.8 | 13.5 |
502+ | CePO (using Llama 3.3 70B) | 69.6 | 84.8 | 80.1 | 31.9 | ** 22.6** |
503+ | QwQ 32B | 61.4 | 90.8 | 82.5 | 44.3 | 7.8 |
504+ | CePO (using QwQ 32B) | 88.1 | ** 92.0** | 86.3 | ** 51.5** | 8.2 |
505+ | DeepSeek R1 Llama | 83.1 | 82.0 | 84.0 | 47.3 | 14.6 |
506+ | CePO (using DeepSeek R1 Llama) | ** 90.2** | 84.0 | ** 89.4** | 47.2 | 15.5 |
507+
508+ ### coc-claude-3-5-sonnet-20241022 on AIME 2024 pass@1 (Nov 2024)
509+
510+ | Model | Score |
511+ | -------| -----:|
512+ | o1-mini | 56.67 |
513+ | coc-claude-3-5-sonnet-20241022 | 46.67 |
514+ | coc-gemini/gemini-exp-1121 | 46.67 |
515+ | o1-preview | 40.00 |
516+ | gemini-exp-1114 | 36.67 |
517+ | claude-3-5-sonnet-20241022 | 20.00 |
518+ | gemini-1.5-pro-002 | 20.00 |
519+ | gemini-1.5-flash-002 | 16.67 |
520+
521521### readurls&memory-gpt-4o-mini on Google FRAMES Benchmark (Oct 2024)
522522| Model | Accuracy |
523523| ----- | -------- |
0 commit comments