## Problem When running benchmarks with `--debug` flag, the following benchmarks fail: - AIME24 - AIME25 - AIW (Alice in Wonderland) - AMC23 - HMMT - MATH500 ## Reproduction Steps ```bash python -m eval.eval \ --model hf \ --tasks AIME24 \ --model_args "pretrained=microsoft/DialoGPT-medium" \ --batch_size 1 \ --output_path logs \ --debug