Name	Name	Last commit message	Last commit date
parent directory ..
1_letter_matching.py	1_letter_matching.py
2_logprob.py	2_logprob.py
3_teacher_forcing.py	3_teacher_forcing.py
README.md	README.md
random_guessing_baseline.py	random_guessing_baseline.py
requirements-extra.txt	requirements-extra.txt

MMLU Benchmarking

This bonus material implements three different methods for evaluating models on MMLU.

Method 1 is meant as an intuitive introduction
Method 2 is the most widely used method in practice
Method 3 is a more robust method that is better suited for reasoning models
Please note that the code loads the MMLU dataset from the Hugging Face model hub. So, you need to install the datasets Python library before running the code:

pip install datasets

uv add datasets

In the following sections, we apply the MMLU evaluation methods to ("high_school_mathematics")
Note that there are many other interesting subsets; this one is chosen for simplicity and efficiency; you can use, for example
- Use --subsets list to list other available subsets
- Use, for example, --subsets "astronomy,high_school_mathematics" to select multiple subsets
- Use --subsets "all" to evaluate on all subsets

(Not that for simplicity and code readability, we focus on a zero-shot, as opposed to a 5-shot, setting.)

Note: If you are not a uv user, replace uv run ...py with python ...py in the examples below.

Method 1: MMLU letter matching

We let the model generate the answer
We extract the first generated A/B/C/D letter and compare it to the correct answer
This is the most intuitive method, but the downside is that the model may not respond with a letter A/B/C/D

➜  02_mmlu git:(main) ✗ uv run 1_letter_matching.py --which_model base     
Using Apple Silicon GPU (MPS)
Using device: mps
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date
MMLU 50 acc=0.240 [high_school_mathematics]
MMLU 100 acc=0.200 [high_school_mathematics]
MMLU 150 acc=0.193 [high_school_mathematics]
MMLU 200 acc=0.235 [high_school_mathematics]
MMLU 250 acc=0.224 [high_school_mathematics]

MMLU letter accuracy: 58/270 = 21.48% in 69.1s
{'accuracy': 0.21481481481481482, 'num_examples': 270, 'subsets': ['high_school_mathematics'], 'split': 'test'}

➜  02_mmlu git:(main) ✗ uv run 1_letter_matching.py --which_model reasoning
Using Apple Silicon GPU (MPS)
Using device: mps
qwen3-0.6B-reasoning.pth: 100% (1433 MiB / 1433 MiB)
tokenizer-reasoning.json: 100% (10 MiB / 10 MiB)
MMLU 50 acc=0.220 [high_school_mathematics]
MMLU 100 acc=0.230 [high_school_mathematics]
MMLU 150 acc=0.220 [high_school_mathematics]
MMLU 200 acc=0.210 [high_school_mathematics]
MMLU 250 acc=0.216 [high_school_mathematics]

MMLU letter accuracy: 57/270 = 21.11% in 43.6s
{'accuracy': 0.2111111111111111, 'num_examples': 270, 'subsets': ['high_school_mathematics'], 'split': 'test'}

Method 2: Log-probability scoring

We run the prompt through the model and get log-probabilities (log-probs) for the next token (see chapter 4 for log-probs discussion)
For each letter choice, we then compute which token ID would appear first if we appended that letter
Then, we compare those four log-probs and pick the highest one (max)

➜  02_mmlu git:(main) ✗ uv run 2_logprob.py --which_model base 
Using Apple Silicon GPU (MPS)
Using device: mps
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date
MMLU 50 acc=0.360 [high_school_mathematics]
MMLU 100 acc=0.420 [high_school_mathematics]
MMLU 150 acc=0.400 [high_school_mathematics]
MMLU 200 acc=0.370 [high_school_mathematics]
MMLU 250 acc=0.344 [high_school_mathematics]

MMLU letter accuracy (log-prob): 93/270 = 34.44% in 22.5s
{'accuracy': 0.34444444444444444, 'num_examples': 270, 'subsets': ['high_school_mathematics'], 'split': 'test'}

➜  02_mmlu git:(main) ✗ uv run 2_logprob.py --which_model reasoning
Using Apple Silicon GPU (MPS)
Using device: mps
✓ qwen3/qwen3-0.6B-reasoning.pth already up-to-date
✓ qwen3/tokenizer-reasoning.json already up-to-date
MMLU 50 acc=0.220 [high_school_mathematics]
MMLU 100 acc=0.230 [high_school_mathematics]
MMLU 150 acc=0.220 [high_school_mathematics]
MMLU 200 acc=0.210 [high_school_mathematics]
MMLU 250 acc=0.216 [high_school_mathematics]

MMLU letter accuracy (log-prob): 57/270 = 21.11% in 22.4s
{'accuracy': 0.2111111111111111, 'num_examples': 270, 'subsets': ['high_school_mathematics'], 'split': 'test'}

Method 3: Teacher forcing

Instead of looking up the log-prob of each of the letters A/B/C/D, a more robust scoring (specifically for reasoning models), is to feed the letter along with the complete answer string
For our example, the answer strings are "A. 7", "B. 11", "C. 16", "D. 8"
This method is known by the unfortunate term "teacher forcing"
This method is the most reliable, but the caveat is that it takes 4x longer than the log-probability approach in method 2 (since we feed the model all 4 answer variants)

➜  02_mmlu git:(main) ✗ uv run 3_teacher_forcing.py --which_model base 
Using Apple Silicon GPU (MPS)
Using device: mps
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
✓ qwen3/tokenizer-base.json already up-to-date
MMLU 50 acc=0.360 [high_school_mathematics]
MMLU 100 acc=0.310 [high_school_mathematics]
MMLU 150 acc=0.307 [high_school_mathematics]
MMLU 200 acc=0.315 [high_school_mathematics]
MMLU 250 acc=0.312 [high_school_mathematics]

MMLU letter accuracy (teacher-forced): 86/270 = 31.85% in 67.9s
{'accuracy': 0.31851851851851853, 'num_examples': 270, 'subsets': ['high_school_mathematics'], 'split': 'test'}

➜  02_mmlu git:(main) ✗ uv run 3_teacher_forcing.py --which_model reasoning
Using Apple Silicon GPU (MPS)
Using device: mps
✓ qwen3/qwen3-0.6B-reasoning.pth already up-to-date
✓ qwen3/tokenizer-reasoning.json already up-to-date
MMLU 50 acc=0.240 [high_school_mathematics]
MMLU 100 acc=0.250 [high_school_mathematics]
MMLU 150 acc=0.267 [high_school_mathematics]
MMLU 200 acc=0.255 [high_school_mathematics]
MMLU 250 acc=0.280 [high_school_mathematics]

MMLU letter accuracy (teacher-forced): 78/270 = 28.89% in 68.8s
{'accuracy': 0.28888888888888886, 'num_examples': 270, 'subsets': ['high_school_mathematics'], 'split': 'test'}

Random guessing baseline

This random guessing baseline is just to put the numbers above into perspective
A model that guesses randomly with uniform (equal) probability across all answers is expected to achieve $25%$ accuracy
However, for a random guesser, we can expect deviations from the $25%$ (depending on the sample size)
For instance, we can model one evaluation run as a binomial with $K$ correct out of $n$ questions:
- $K \sim \mathrm{Binomial}(n,p)$ with $p=\tfrac14$ and $n=$ number of questions.
- Accuracy $A = K/n$.
Let's walk through this for the high_school_mathematics subset with $n=270$
In general, the properties of the binomial are:
- Mean: $\mathbb{E}[K] = np$
- SD: $\sigma_K = \sqrt{np(1-p)}$
For accuracy $A=K/n$:
- Mean: $\mathbb{E}[A] = p = 0.25$
- SD: $\sigma_A = \sqrt{\tfrac{p(1-p)}{n}}$
Plugging in $n=270$:
- $\mathbb{E}[A] = 25%$
- $\sigma_A = \sqrt{\tfrac{0.25\cdot 0.75}{270}} \approx 2.64%$
Convert the one standard deviation ($\pm 1\sigma$) accuracy bounds to counts:
- Lower: $K \le \lfloor 270,(0.25-0.02636)\rfloor = 60$
- Upper: $K \ge \lceil 270,(0.25+0.02636)\rceil = 75$
- (Inside the band is $K=61,\dots,74$; equivalently $A\in[22.36%,,27.64%]$)
So, the probability of falling outside this bound is:

$$ z = \pm,\frac{75-67.5}{\sqrt{270\cdot 0.25\cdot 0.75}} \approx \pm 1.054, \qquad \Pr(|A-0.25|>0.02636) \approx 2\bigl(1-\Phi(1.054)\bigr) \approx 0.292. $$

So about 29.2% of random-guess runs are below 22.36% or above 27.64%
This means in about $29.2%$ of cases where the model is random guessing (assuming uniformly), we get an accuracy below $22.36%$ or above $27.64%$
Below is an empirical look:

➜  02_mmlu git:(main) ✗ uv run 0_random_guessing_baseline.py --subset "high_school_mathematics"
Subset: high_school_mathematics | split: test | n=270
Gold distribution provided in the dataset:
  A: 57 (21.11%)
  B: 71 (26.30%)
  C: 71 (26.30%)
  D: 71 (26.30%)

Random guessing over 10,000 trials (uniform A/B/C/D, seed=42):
  Mean accuracy: 24.98%
  Std dev across trials: 2.65%

Selected quantiles of accuracy:
  1% quantile: 18.889%
  5% quantile: 20.741%
  25% quantile: 23.333%
  50% quantile: 24.815%
  75% quantile: 26.667%
  95% quantile: 29.259%
  99% quantile: 31.111%

Full frequency table of accuracies (rounded):
  0.160: 1 times (0.01%)
  0.170: 11 times (0.11%)
  0.180: 38 times (0.38%)
  0.190: 124 times (1.24%)
  0.200: 302 times (3.02%)
  0.210: 562 times (5.62%)
  0.220: 612 times (6.12%)
  0.230: 1254 times (12.54%)
  0.240: 1619 times (16.19%)
  0.250: 1096 times (10.96%)
  0.260: 1525 times (15.25%)
  0.270: 1248 times (12.48%)
  0.280: 572 times (5.72%)
  0.290: 565 times (5.65%)
  0.300: 281 times (2.81%)
  0.310: 132 times (1.32%)
  0.320: 28 times (0.28%)
  0.330: 24 times (0.24%)
  0.340: 5 times (0.05%)
  0.360: 1 times (0.01%)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

MMLU Benchmarking

Method 1: MMLU letter matching

Method 2: Log-probability scoring

Method 3: Teacher forcing

Random guessing baseline

FilesExpand file tree

02_mmlu

Directory actions

More options

Directory actions

More options

Latest commit

History

02_mmlu

Folders and files

parent directory

README.md

MMLU Benchmarking

Method 1: MMLU letter matching

Method 2: Log-probability scoring

Method 3: Teacher forcing

Random guessing baseline