Debug mode fails for AIME24 and 5 other benchmarks

## Problem
When running benchmarks with `--debug` flag, the following benchmarks fail:
- AIME24
- AIME25 
- AIW (Alice in Wonderland)
- AMC23
- HMMT
- MATH500

## Reproduction Steps
```bash
python -m eval.eval \
  --model hf \
  --tasks AIME24 \
  --model_args "pretrained=microsoft/DialoGPT-medium" \
  --batch_size 1 \
  --output_path logs \
  --debug