Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
fa10754
initial skeleton of the script
Feb 1, 2026
318490f
Pipeline Script
Feb 1, 2026
8cfcb26
Added scripts to evaluate following metrics for a benchmark:
sardug10 Feb 3, 2026
dc8448a
updated folder structure
Feb 3, 2026
aa15e1e
Fixed precommit errors and changed folder name
sardug10 Feb 4, 2026
255afa6
Added HumanEval script
Feb 4, 2026
1897537
Merge branch 'p17/scripts/benchmark-execution' of https://github.com/…
sardug10 Feb 4, 2026
28666ca
Merge branch 'main' into p17/scripts/benchmark-execution
nitin-vig Feb 4, 2026
6bf1e97
Merge branch 'p17/scripts/benchmark-execution' of https://github.com/…
sardug10 Feb 4, 2026
9240b70
Fixed precommit check
sardug10 Feb 4, 2026
6edd3da
Added scripts to evaluate following metrics for Indic Glue benchmark
sardug10 Feb 4, 2026
8f180f2
Merge branch 'main' into p17/scripts/benchmark-execution
nitin-vig Feb 4, 2026
2fa9e25
added benchmark scripts for IndiGlue
Feb 6, 2026
9295181
Merge branch 'main' into p17/scripts/benchmark-execution
nitin-vig Feb 6, 2026
038ca6e
Added OLMES pipeline
Feb 7, 2026
692addf
fixed colab exection
Feb 8, 2026
ff68b08
fixed colab exection
Feb 8, 2026
e330549
fixed colab exection
Feb 8, 2026
cd852b2
fixing dependency conflicts
Feb 8, 2026
f6e48ca
fixing dependency conflicts
Feb 8, 2026
c7629b4
fixing dependency conflicts
Feb 9, 2026
a5f259b
fixing dependency conflicts
Feb 9, 2026
a330dc7
Added colab specific checks
Feb 9, 2026
b2ee2d8
Added colab specific code
Feb 10, 2026
54f0a54
Added Benhcmark plan
Feb 11, 2026
6427895
added benchmark config
Feb 11, 2026
4c63912
removed unused code
Feb 12, 2026
3b6e8f2
updated pipelines for reporting and fast feedback
Feb 12, 2026
df61278
updated pipelines for reporting and fast feedback
Feb 12, 2026
ec26404
Updated pipeline to add the benchmark execution time and total time
gaurav-chawla-2 Feb 13, 2026
ed95181
Added command line args to the test pipeline script
gaurav-chawla-2 Feb 13, 2026
ef9b7cd
Implemented HF_TOKEN for restricted datasets
gaurav-chawla-2 Feb 13, 2026
76677ac
updated documentation
Feb 13, 2026
c7a5485
Merge pull request #467 from The-School-of-AI/p17/scripts/benchmark-e…
nitin-vig Feb 13, 2026
425617e
removed code execution benhcmarks
Feb 15, 2026
e4bcf2d
Merge remote changes and apply GPQA mirror fix
Feb 15, 2026
c492b59
updated config
Feb 15, 2026
f90ae21
Added context length benchmarks
Feb 16, 2026
375c3d5
Fixed conflict in requirements
gaurav-chawla-2 Feb 16, 2026
7c9683b
Update requirements.txt
gaurav-chawla-2 Feb 16, 2026
ce4e015
Update indic_glue.py
gaurav-chawla-2 Feb 16, 2026
21f5234
Merge pull request #490 from The-School-of-AI/p17/scripts/benchmark-e…
nitin-vig Feb 16, 2026
5ba3235
Fixed shell scripts
gaurav-chawla-2 Feb 16, 2026
1de9e37
Merge pull request #492 from The-School-of-AI/p17/scripts/benchmark-e…
nitin-vig Feb 16, 2026
e956845
Added batch size as arg and updated eval-runner to use it
gaurav-chawla-2 Feb 17, 2026
3e58f6f
Merge pull request #494 from The-School-of-AI/p17/scripts/benchmark-e…
nitin-vig Feb 17, 2026
4dea2d9
added worker support
Feb 17, 2026
f9aa78a
added multi worker support
Feb 17, 2026
a0fab3d
removed vllm
nitin-vig Feb 17, 2026
21795be
added batch capability
nitin-vig Feb 17, 2026
e17dc43
Fix the README and running instructions
gaurav-chawla-2 Feb 17, 2026
6abacf4
Merge pull request #495 from The-School-of-AI/p17/scripts/benchmark-e…
nitin-vig Feb 17, 2026
f0d00db
used granular benchmarks for reporting
nitin-vig Feb 21, 2026
3f5b96c
updated benchmarks for context length
nitin-vig Feb 21, 2026
eb87d45
updated docs
nitin-vig Feb 21, 2026
d0f924c
Pre-commit hooks fixes
gaurav-chawla-2 Feb 21, 2026
0860efb
Sample run on H200 1 GPU
gswin334 Feb 21, 2026
a094c59
Fixed pre-commit hooks
gaurav-chawla-2 Feb 21, 2026
df58d0b
Merge branch 'p17/scripts/benchmark-execution' of https://github.com/…
gaurav-chawla-2 Feb 21, 2026
7b5ab27
Update .pre-commit-config.yaml
gaurav-chawla-2 Feb 21, 2026
228e527
Update .pre-commit-config.yaml
gaurav-chawla-2 Feb 21, 2026
7cae674
Merge branch 'staging' into p17/scripts/benchmark-execution
nitin-vig Feb 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ repos:
- id: black

- repo: https://github.com/pycqa/isort
rev: 5.13.2
rev: 8.0.0
hooks:
- id: isort
args: ["--profile=black"]
Expand All @@ -18,7 +18,7 @@ repos:
- id: ruff

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
rev: v6.0.0
hooks:
- id: check-json
exclude: tsconfig.*\.json$
Expand Down
8 changes: 8 additions & 0 deletions experiments/17_final_pretraining_benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
__pycache__/
benchmark-results/
olmes_runs/
*.log
all_groups.txt
all_tasks.txt
all_tasks_clean.txt

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
> [!IMPORTANT]
> **Unified Execution**: This benchmarking suite is now orchestrated from the `02-OLMES-v1` directory. All commands should be run from that location to ensure correct path resolution and dependency management.

# Pre-training & SFT Benchmarking Suite

This module provides a robust, configurable framework to run specific benchmarks based on the model training stage (1B, 3B, 8B, 70B) and phase (Pre-training/SFT). It leverages the `lm-evaluation-harness` for standard tasks and supports custom scripts for advanced evaluations.

## 🚀 Quick Start

### 1. Installation
The evaluator automatically detects and uses a local `.venv` if present.
```bash
# Recommended: create and setup venv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

### 2. Run a Trial (Quick Verification)
Run a limited execution (5 samples per task) to ensure weights and metrics load correctly:
```bash
# Must be run from inside 02-OLMES-v1/
python3 src/eval_runner.py \
--config configs/stage_1b.yaml \
--phase pretraining \
--model_args "pretrained=HuggingFaceTB/SmolLM2-135M" \
--device "cpu" \
--trial
```

### 3. Full Execution (Production)
Run the full benchmarking suite on your checkpoint:
```bash
# Must be run from inside 02-OLMES-v1/
python3 src/eval_runner.py \
--config configs/stage_1b.yaml \
--phase pretraining \
--model_args "pretrained=path/to/checkpoint" \
--device "cuda:0" \
--batch_size "1"
```

## ⚡ Batch Size Recommendations

The `--batch_size` parameter has a major impact on total evaluation time. **We recommend using a fixed value over `auto`** to avoid the slow per-task auto-detection phase.

| Mode | Value | Recommended For | Rationale |
| :--- | :--- | :--- | :--- |
| **Standard** | `1` | **Default / Robustness** | Skips slow auto-detection. Safest for all hardware and model sizes. |
| **High Perf** | `32`, `64` | **Optimized Runs** | Fastest execution if you know your GPU's VRAM limits. |
| **Auto** | `auto` | *Discouraged* | Convenient but adds significant overhead by searching for batch size on every task. |

---

## 📂 Output Structure
Every execution creates a unique, timestamped directory to prevent data loss and ensure clean logs:

```text
benchmark-results/
└── [stage]/ (e.g., 1b, 8b)
└── [phase]/ (e.g., pretraining, sft)
└── [timestamp]/ (e.g., 20240201_123000)
├── incremental_results.json <-- Saved after every task
├── logs/
│ └── execution.log <-- Full stdout/stderr capture
├── reports/
│ └── summary_report.md <-- Human-readable Markdown
└── harness_raw/ <-- Raw JSON from lm-eval per task
├── mmlu.json
└── gsm8k.json
```

## 🛠 Features

### 1. Granular Reporting
The YAML configuration supports specific subjects and subsets. The evaluator will automatically:
- Expand `subjects` (e.g., MMLU subjects) into individual harness tasks.
- Aggregate these sub-tasks into a parent benchmark score.
- Generate a nested Markdown report showing both the **Aggregate** and **↳ Sub-task** scores.

### 2. Incremental Saving
Results are saved to `incremental_results.json` immediately after each benchmark completes. If a 24-hour run crashes at hour 23, you still have 23 hours of data.

### 3. Intelligent Execution
- **Venv Detection**: Automatically uses `.venv/bin/python3` if available.
- **Robust Parsing**: Extracts metrics from `lm-eval` regardless of the filter used (e.g., `acc`, `exact_match`, `acc_norm`).
- **Conflict Resolution**: Strips potential `device` conflicts between CLI arguments and `model_args`.

## ⚙️ Configuration (YAML)
Each benchmark in `configs/*.yaml` can specify:
- `harness_task`: Base task name in `lm-evaluation-harness`.
- `subjects`: List of MMLU-style subjects to run specifically.
- `tasks`: List of specific subsets (e.g., for BLiMP).
- `subset`: Specific dataset subset (e.g., for TriviaQA).
- `shots`: Number of few-shot examples.
- `phases`: `[pretraining]` or `[sft]`.

## 📊 Benchmark Analysis (NEW!)

Before finalizing your evaluation pipeline, use the **Benchmark Analysis Toolkit** to investigate datasets:

```bash
cd benchmark_analysis
./run_analysis.sh
```

This toolkit helps you:
- **Count tokens** in test datasets (for FLOPs calculation)
- **Identify evaluation metrics** (accuracy, F1, exact match, etc.)
- **Estimate computational requirements** (GPU-hours, FLOPs)
- **Generate reports** (CSV, JSON, Markdown)

📖 **Documentation**: See `benchmark_analysis/INDEX.md` for complete guide

**Quick Links**:
- `benchmark_analysis/QUICKSTART.md` - Get started in 5 minutes
- `benchmark_analysis/OVERVIEW.md` - Understand the toolkit
- `benchmark_analysis/WORKFLOW.md` - Step-by-step process

---

## 🚧 Future Work & TODOs

- [ ] **Custom Scripts**: Implement logic for placeholders in `scripts/` (AIME 2025, IndicGLUE, SWE-bench, etc.).
- [x] **Benchmarks List Review**: ✅ Use `benchmark_analysis/` toolkit to analyze all 25 benchmarks
- [ ] **Custom Scripts Implementation**:
- [ ] `simpleqa.py` (SimpleQA Verified)
- [ ] `aime2025.py` (AIME 2025)
- [ ] `indic_glue.py` (IndicGLUE: iitp-movie-reviews, bbc-news-articles, article-genre-classification)
- [ ] `indic_qa.py` (IndicQA)
- [ ] `leval.py` (L-Eval)
- [ ] `ruler.py` (RULER)
- [ ] `indic_bias.py` (Indic-Bias / FairITales)
- [ ] `swebench.py` (SWE-bench)
- [ ] `helm_safety.py` (HELM Safety)
- [ ] **Benchmarks List Review**: Review `benchmarks-list.txt` to ensure it is exhaustive and aligns with project deliverables.
- [ ] **YAML Config Audit**: Comprehensive review of all task parameters (shots, mode, paradigm) in `1b`, `3b`, `8b`, `70b`, and `sft` YAMLs vs `Paradigms.md`.
- [ ] **Multi-GPU Support**: Add orchestration for native model parallelism and FSDP via `accelerate`.
- [ ] **MSGS Benchmark**: Investigate missing MSGS tasks in `v0.4.x` and re-enable or find substitutes.
- [ ] **Visualizer**: Create a simple utility to compare `incremental_results.json` files across different model stages.

---
*Refer to `benchmarks-list.txt` for the full list of 25 supported benchmarks.*
*Use `benchmark_analysis/` to investigate token counts and metrics before finalizing configs.*
*Refer to `01-EleutherAI-v1/benchmarks-list.txt` for the full list of 25 supported benchmarks.*
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Benchmark
MMLU
TriviaQA
MMLU-Pro
GPQA Diamond
GSM8K
BBH (Big Bench Hard)
ARC-Challenge
MATH
IFEval
SimpleQA_Verified
HumanEval
APPS (Automated Programming Progress Standard)
AIME 2025
"MSGS (Mixed Signals Generalization Set)(Diagnostics)"
BLiMP
IndicGLUE
IndicQA
L-Eval (Long Context Evaluation Suite)
RULER
TruthfulQA
Indic-Bias (FairITales)
IndicMMLU-Pro
HellaSwag
Winogrande
Loading
Loading