Skip to content

Commit 873c210

Browse files
authored
Merge pull request #1 from Technolog796/master
Add new models and task
2 parents 6f88d85 + 621f92d commit 873c210

20 files changed

+3980
-402
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
venv
2+
results/cache
3+
src/__pycache__
4+
configs/run.yaml
5+
results/details/

readme.md

Lines changed: 57 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,77 @@
11
# DOOM Deadly Olympiad of Math, most codebase from openai simpleeval
22

3+
Это бенчмарк для оценки качества языковых моделей на математических задачах.
34

45
=== LEADERBOARD ===
56

6-
| Model | Score | Tokens Used | System Prompt | Evaluation Time | Details
7-
|-------|--------|-------------|---------------|----------------|----------
8-
| o3-mini-2025-01-31 | 0.442 | 0 | You are a helpful math assi... | 595.1s | [Details](details/o3-mini-2025-01-31/details_20250408_121611.md)
9-
| gpt-4o | 0.353 | 0 | You are a helpful math assi... | 515.1s | [Details](details/gpt-4o/details_20250408_115501.md)
10-
| gpt-4o-mini | 0.311 | 0 | You are a helpful math assi... | 682.2s | [Details](details/gpt-4o-mini/details_20250408_115501.md)
7+
| Model | Score | Tokens Used | System Prompt | Evaluation Time | Dataset | Details
8+
|-------|--------|-------------|---------------|----------------|---------|----------
9+
| o3-mini-2025-01-31 | 0.400 | 0 | You are a helpful math assi... | 14.8s | RussianMath | [Details](details/o3-mini-2025-01-31/details_20250408_072911.md)
10+
| gpt-4o | 0.332 | 0 | You are a helpful math assi... | 486.6s | RussianMath | [Details](details/gpt-4o/details_20250409_235721.md)
11+
| gpt-4o-mini | 0.300 | 0 | You are a helpful math assi... | 504.3s | RussianMath | [Details](details/gpt-4o-mini/details_20250409_235721.md)
12+
| GigaChat-2-Max | 0.205 | 83643 | You are a helpful math assi... | 418.1s | RussianMath | [Details](details/GigaChat-2-Max/details_20250410_154315.md)
13+
| GigaChat-2-Pro | 0.195 | 87907 | You are a helpful math assi... | 374.4s | RussianMath | [Details](details/GigaChat-2-Pro/details_20250410_154315.md)
14+
| GigaChat-Max | 0.158 | 91274 | You are a helpful math assi... | 512.1s | RussianMath | [Details](details/GigaChat-Max/details_20250410_154315.md)
15+
| GigaChat-2 | 0.089 | 73978 | You are a helpful math assi... | 221.2s | RussianMath | [Details](details/GigaChat-2/details_20250410_154315.md)
1116

12-
## Run
17+
## Поддерживаемые датасеты
18+
19+
1. **RussianMath** - задачи по математике на русском языке (основной датасет)
20+
2. **MathDemon_Demidovich** - подмножества задач из учебника Демидовича, включая:
21+
- Approximation_by_Polynomials
22+
- Continuous_Functions
23+
- Convex_Functions
24+
- Differentiation
25+
- Improper_Integrals
26+
- Infinite_Series
27+
- Integration
28+
- Sequences_and_Limits
29+
- Series_of_Functions
30+
31+
## Запуск
32+
33+
### Базовый запуск (все датасеты)
1334

1435
```bash
1536
python runner.py
1637
```
1738

18-
## Config
39+
### Выбор конкретного датасета
40+
41+
```bash
42+
python runner.py --dataset russianmath # Только датасет RussianMath
43+
python runner.py --dataset mathdemon # Только датасет MathDemon_Demidovich
44+
```
45+
46+
### Другие параметры
47+
48+
```bash
49+
python runner.py --no-cache # Игнорировать кэш и повторно выполнить оценку
50+
python runner.py --max-workers 8 # Установить количество параллельных обработчиков
51+
python runner.py --config path/to/config.yaml # Указать альтернативный конфиг
52+
```
53+
54+
### Справка по параметрам
55+
56+
```bash
57+
python runner.py --help
58+
```
59+
60+
## Конфигурация
61+
62+
Настройка выполняется через файлы YAML в директории `configs/`:
1963

2064
```yaml
2165
configs/run.yaml
2266
```
2367

24-
## Leaderboard
68+
## Генерация таблицы лидеров
2569

70+
71+
После запуска оценки автоматически будет сгенерирована таблица лидеров.
72+
Она сохраняется в `results/leaderboard.md`.
73+
=======
2674
```bash
2775
python leaderboard.py
2876
```
77+

requirements.txt

2.8 KB
Binary file not shown.

results/leaderboard.md

Lines changed: 53 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,55 @@
11
# Math Evaluation Leaderboard
22

3-
Last updated: 2025-04-09 09:53:17
4-
5-
| Model | Score | Tokens Used | System Prompt | Evaluation Time | Details | Model Info |
6-
|-------|--------|-------------|---------------|----------------|----------|------------|
7-
| o3-mini-2025-01-31 | 0.442 | 0 | You are a helpful math assi... | 595.1s | [Details](details/o3-mini-2025-01-31/details_20250408_121611.md) | |
8-
| meta-llama/llama-4-scout | 0.374 | 0 | You are a helpful math assi... | 648.5s | [Details](details/meta-llama/llama-4-scout/details_20250409_073324.md) | |
9-
| gpt-4o | 0.353 | 0 | You are a helpful math assi... | 515.1s | [Details](details/gpt-4o/details_20250408_115501.md) | |
10-
| gpt-4o-mini | 0.311 | 0 | You are a helpful math assi... | 682.2s | [Details](details/gpt-4o-mini/details_20250408_115501.md) | |
11-
| deepseek/deepseek-chat-v3-0324 | 0.100 | 0 | You are a helpful math assi... | 6813.4s | [Details](details/deepseek/deepseek-chat-v3-0324/details_20250409_075941.md) | |
3+
| Model | Combined Score | RussianMath Score | MathDemon Score | Tokens Used | System Prompt | Evaluation Time | Dataset | Details |
4+
|-------|---------------|------------------|----------------|-------------|---------------|----------------|---------|----------|
5+
| gpt-4o-mini | 0.321 | 0.321 | 0.173 | 251078 | Вы - полезный помощник по м... | 950.2s | RussianMath, MathDemon | [RussianMath](details/gpt-4o-mini/details_20250413_204220.md), [MathDemon](details/gpt-4o-mini/details_20250413_205901.md) |
6+
| └─ Approximation_by_Polynomials | - | - | 0.429 | 5191 | - | 27.2s | MathDemon/Approximation_by_Polynomials | [Details](details/gpt-4o-mini/details_20250413_205329.md) |
7+
| └─ Continuous_Functions | - | - | 0.143 | 5994 | - | 22.2s | MathDemon/Continuous_Functions | [Details](details/gpt-4o-mini/details_20250413_205356.md) |
8+
| └─ Convex_Functions | - | - | 0.182 | 8705 | - | 29.9s | MathDemon/Convex_Functions | [Details](details/gpt-4o-mini/details_20250413_205430.md) |
9+
| └─ Differentiation | - | - | 0.111 | 8561 | - | 31.3s | MathDemon/Differentiation | [Details](details/gpt-4o-mini/details_20250413_205505.md) |
10+
| └─ Improper_Integrals | - | - | 0.111 | 8269 | - | 43.7s | MathDemon/Improper_Integrals | [Details](details/gpt-4o-mini/details_20250413_205553.md) |
11+
| └─ Infinite_Series | - | - | 0.154 | 10342 | - | 56.0s | MathDemon/Infinite_Series | [Details](details/gpt-4o-mini/details_20250413_205652.md) |
12+
| └─ Integration | - | - | 0.091 | 9718 | - | 44.0s | MathDemon/Integration | [Details](details/gpt-4o-mini/details_20250413_205740.md) |
13+
| └─ Sequences_and_Limits | - | - | 0.000 | 7275 | - | 28.3s | MathDemon/Sequences_and_Limits | [Details](details/gpt-4o-mini/details_20250413_205824.md) |
14+
| └─ Series_of_Functions | - | - | 0.333 | 9428 | - | 32.7s | MathDemon/Series_of_Functions | [Details](details/gpt-4o-mini/details_20250413_205901.md) |
15+
| GigaChat-2-Max | 0.195 | 0.195 | 0.095 | 123361 | Вы - полезный помощник по м... | 588.0s | RussianMath, MathDemon | [RussianMath](details/GigaChat-2-Max/details_20250413_204220.md), [MathDemon](details/GigaChat-2-Max/details_20250413_205901.md) |
16+
| └─ Approximation_by_Polynomials | - | - | 0.143 | 3942 | - | 17.8s | MathDemon/Approximation_by_Polynomials | [Details](details/GigaChat-2-Max/details_20250413_205319.md) |
17+
| └─ Continuous_Functions | - | - | 0.143 | 4018 | - | 17.6s | MathDemon/Continuous_Functions | [Details](details/GigaChat-2-Max/details_20250413_205350.md) |
18+
| └─ Convex_Functions | - | - | 0.000 | 2682 | - | 21.3s | MathDemon/Convex_Functions | [Details](details/GigaChat-2-Max/details_20250413_205420.md) |
19+
| └─ Differentiation | - | - | 0.111 | 5177 | - | 20.3s | MathDemon/Differentiation | [Details](details/GigaChat-2-Max/details_20250413_205454.md) |
20+
| └─ Improper_Integrals | - | - | 0.111 | 2988 | - | 23.5s | MathDemon/Improper_Integrals | [Details](details/GigaChat-2-Max/details_20250413_205532.md) |
21+
| └─ Infinite_Series | - | - | 0.154 | 6052 | - | 27.1s | MathDemon/Infinite_Series | [Details](details/GigaChat-2-Max/details_20250413_205624.md) |
22+
| └─ Integration | - | - | 0.000 | 3960 | - | 25.8s | MathDemon/Integration | [Details](details/GigaChat-2-Max/details_20250413_205722.md) |
23+
| └─ Sequences_and_Limits | - | - | 0.111 | 4647 | - | 20.5s | MathDemon/Sequences_and_Limits | [Details](details/GigaChat-2-Max/details_20250413_205816.md) |
24+
| └─ Series_of_Functions | - | - | 0.083 | 5043 | - | 28.0s | MathDemon/Series_of_Functions | [Details](details/GigaChat-2-Max/details_20250413_205855.md) |
25+
| GigaChat-2-Pro | 0.179 | 0.179 | 0.099 | 133525 | Вы - полезный помощник по м... | 578.7s | RussianMath, MathDemon | [RussianMath](details/GigaChat-2-Pro/details_20250413_204220.md), [MathDemon](details/GigaChat-2-Pro/details_20250413_205901.md) |
26+
| └─ Approximation_by_Polynomials | - | - | 0.000 | 1242 | - | 14.8s | MathDemon/Approximation_by_Polynomials | [Details](details/GigaChat-2-Pro/details_20250413_205316.md) |
27+
| └─ Continuous_Functions | - | - | 0.143 | 3801 | - | 18.2s | MathDemon/Continuous_Functions | [Details](details/GigaChat-2-Pro/details_20250413_205351.md) |
28+
| └─ Convex_Functions | - | - | 0.091 | 6691 | - | 23.2s | MathDemon/Convex_Functions | [Details](details/GigaChat-2-Pro/details_20250413_205422.md) |
29+
| └─ Differentiation | - | - | 0.333 | 4747 | - | 25.5s | MathDemon/Differentiation | [Details](details/GigaChat-2-Pro/details_20250413_205459.md) |
30+
| └─ Improper_Integrals | - | - | 0.000 | 4302 | - | 18.8s | MathDemon/Improper_Integrals | [Details](details/GigaChat-2-Pro/details_20250413_205527.md) |
31+
| └─ Infinite_Series | - | - | 0.154 | 7120 | - | 26.9s | MathDemon/Infinite_Series | [Details](details/GigaChat-2-Pro/details_20250413_205623.md) |
32+
| └─ Integration | - | - | 0.000 | 7730 | - | 30.9s | MathDemon/Integration | [Details](details/GigaChat-2-Pro/details_20250413_205727.md) |
33+
| └─ Sequences_and_Limits | - | - | 0.000 | 3723 | - | 17.9s | MathDemon/Sequences_and_Limits | [Details](details/GigaChat-2-Pro/details_20250413_205814.md) |
34+
| └─ Series_of_Functions | - | - | 0.167 | 6887 | - | 27.1s | MathDemon/Series_of_Functions | [Details](details/GigaChat-2-Pro/details_20250413_205854.md) |
35+
| GigaChat-Max | 0.168 | 0.168 | 0.065 | 147924 | Вы - полезный помощник по м... | 784.1s | RussianMath, MathDemon | [RussianMath](details/GigaChat-Max/details_20250413_204220.md), [MathDemon](details/GigaChat-Max/details_20250413_205901.md) |
36+
| └─ Approximation_by_Polynomials | - | - | 0.000 | 5190 | - | 27.9s | MathDemon/Approximation_by_Polynomials | [Details](details/GigaChat-Max/details_20250413_205329.md) |
37+
| └─ Continuous_Functions | - | - | 0.143 | 2743 | - | 18.2s | MathDemon/Continuous_Functions | [Details](details/GigaChat-Max/details_20250413_205351.md) |
38+
| └─ Convex_Functions | - | - | 0.091 | 4103 | - | 28.1s | MathDemon/Convex_Functions | [Details](details/GigaChat-Max/details_20250413_205427.md) |
39+
| └─ Differentiation | - | - | 0.000 | 2079 | - | 21.0s | MathDemon/Differentiation | [Details](details/GigaChat-Max/details_20250413_205455.md) |
40+
| └─ Improper_Integrals | - | - | 0.111 | 6634 | - | 38.6s | MathDemon/Improper_Integrals | [Details](details/GigaChat-Max/details_20250413_205547.md) |
41+
| └─ Infinite_Series | - | - | 0.154 | 5528 | - | 32.2s | MathDemon/Infinite_Series | [Details](details/GigaChat-Max/details_20250413_205629.md) |
42+
| └─ Integration | - | - | 0.000 | 10020 | - | 56.3s | MathDemon/Integration | [Details](details/GigaChat-Max/details_20250413_205752.md) |
43+
| └─ Sequences_and_Limits | - | - | 0.000 | 4030 | - | 23.7s | MathDemon/Sequences_and_Limits | [Details](details/GigaChat-Max/details_20250413_205820.md) |
44+
| └─ Series_of_Functions | - | - | 0.083 | 4087 | - | 30.8s | MathDemon/Series_of_Functions | [Details](details/GigaChat-Max/details_20250413_205858.md) |
45+
| GigaChat-2 | 0.116 | 0.116 | 0.042 | 103214 | Вы - полезный помощник по м... | 337.6s | RussianMath, MathDemon | [RussianMath](details/GigaChat-2/details_20250413_204220.md), [MathDemon](details/GigaChat-2/details_20250413_205901.md) |
46+
| └─ Approximation_by_Polynomials | - | - | 0.000 | 2307 | - | 6.3s | MathDemon/Approximation_by_Polynomials | [Details](details/GigaChat-2/details_20250413_205308.md) |
47+
| └─ Continuous_Functions | - | - | 0.000 | 3390 | - | 7.9s | MathDemon/Continuous_Functions | [Details](details/GigaChat-2/details_20250413_205340.md) |
48+
| └─ Convex_Functions | - | - | 0.000 | 5135 | - | 13.2s | MathDemon/Convex_Functions | [Details](details/GigaChat-2/details_20250413_205412.md) |
49+
| └─ Differentiation | - | - | 0.111 | 4267 | - | 10.4s | MathDemon/Differentiation | [Details](details/GigaChat-2/details_20250413_205444.md) |
50+
| └─ Improper_Integrals | - | - | 0.111 | 4432 | - | 9.0s | MathDemon/Improper_Integrals | [Details](details/GigaChat-2/details_20250413_205517.md) |
51+
| └─ Infinite_Series | - | - | 0.154 | 4210 | - | 15.2s | MathDemon/Infinite_Series | [Details](details/GigaChat-2/details_20250413_205612.md) |
52+
| └─ Integration | - | - | 0.000 | 424 | - | 18.9s | MathDemon/Integration | [Details](details/GigaChat-2/details_20250413_205715.md) |
53+
| └─ Sequences_and_Limits | - | - | 0.000 | 4084 | - | 8.6s | MathDemon/Sequences_and_Limits | [Details](details/GigaChat-2/details_20250413_205804.md) |
54+
| └─ Series_of_Functions | - | - | 0.000 | 3575 | - | 15.7s | MathDemon/Series_of_Functions | [Details](details/GigaChat-2/details_20250413_205843.md) |
55+
=======

0 commit comments

Comments
 (0)