Skip to content

Commit d438332

Browse files
authored
Adds VideoMathQA - Task Designed to Evaluate Mathematical Reasoning in Real-World Educational Videos (#702)
* Adds VideoMathQA (https://mbzuai-oryx.github.io/VideoMathQA) task. * Adds VideoMathQA (https://mbzuai-oryx.github.io/VideoMathQA) task.
1 parent fd3c308 commit d438332

13 files changed

+1033
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
<details>
3434
<summary>We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. </summary>
3535

36+
- [2025-06] 🎉🎉 We welcome the new task [VideoMathQA](https://mbzuai-oryx.github.io/VideoMathQA), designed to evaluate mathematical reasoning in real-world educational videos.
3637
- [2024-10] 🎉🎉 We welcome the new task [NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
3738
- [2024-10] 🎉🎉 We welcome the new task [TemporalBench](https://huggingface.co/datasets/microsoft/TemporalBench) for fine-grained temporal understanding and reasoning for videos, which reveals a huge (>30%) human-AI gap.
3839
- [2024-10] 🎉🎉 We welcome the new tasks [VDC](https://rese1f.github.io/aurora-web/) for video detailed captioning, [MovieChat-1K](https://rese1f.github.io/MovieChat/) for long-form video understanding, and [Vinoground](https://vinoground.github.io/), a temporal counterfactual LMM benchmark composed of 1000 short natural video-caption pairs. We also welcome the new models: [AuroraCap](https://github.com/rese1f/aurora) and [MovieChat](https://github.com/rese1f/MovieChat).

lmms_eval/tasks/videomathqa/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
2+
VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the “needle-in-a-multimodal-haystack” problem, where key information is sparse and spread across different formats and moments in the video.
3+
4+
[![Website](https://img.shields.io/badge/🌐_Project-Website-87CEEB)](https://mbzuai-oryx.github.io/VideoMathQA)
5+
[![Dataset](https://img.shields.io/badge/🤗_Dataset-Access-green)](https://huggingface.co/datasets/MBZUAI/VideoMathQA)
6+
[![🏅 Leaderboard (Reasoning)](https://img.shields.io/badge/🏅_Leaderboard-Reasoning-red)](https://hanoonar.github.io/VideoMathQA/#leaderboard-2)
7+
[![🏅 Leaderboard (Direct)](https://img.shields.io/badge/🏅_Leaderboard-Direct-yellow)](https://hanoonar.github.io/VideoMathQA/#leaderboard)
8+
[![GitHub](https://img.shields.io/badge/📂_GitHub-VideoMathQA-green)](https://github.com/mbzuai-oryx/VideoMathQA)
9+
10+
## Evaluation Strategies
11+
12+
**VideoMathQA** supports the following **evaluation strategies** to comprehensively assess model performance:
13+
14+
1. **MCQ and Multi-Binary (MBin)**
15+
- Tasks with `mcq` use a 5-way multiple-choice format.
16+
- Tasks with `mbin` use a stricter binary-pairwise evaluation format (correct vs each distractor).
17+
- Both formats are available *with* and *without subtitles*, indicated by `_w_subtitles` in the task name.
18+
19+
2. **Direct Answering vs. Chain-of-Thought (CoT)**
20+
- Each task can be evaluated under **Direct** or **CoT** prompting.
21+
- Tasks containing `_cot` use CoT prompting, where models generate reasoning before the final answer.
22+
- Direct answering tasks expect the final answer only, without intermediate reasoning.
23+
- CoT tasks require post-processing to extract the final answer (see [Post Processing](#post-processing)).
24+
- We maintain **separate leaderboards** for Direct and CoT settings.
25+
26+
3. **Step-wise CoT Evaluation**
27+
- For CoT tasks, we additionally evaluate the quality of generated reasoning.
28+
- Each response is scored by comparing against annotated solution steps (typically 4–10 steps).
29+
- Scoring is done using a small open-source model (Qwen-3-4B in thinking mode), which returns a score (0–10) and rationale.
30+
31+
32+
## Run Evaluation
33+
34+
Please run the following command to start evaluation.
35+
36+
```python
37+
accelerate launch --num_processes=8 -m lmms_eval \
38+
--model qwen2_5_vl \
39+
--model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=151200,min_pixels=100352,use_flash_attention_2=True,device_map=auto \
40+
--tasks videomathqa_mbin \
41+
--batch_size 1 --log_samples --log_samples_suffix qwen_2_5_vl \
42+
--output_path output
43+
```
44+
45+
This command starts evaluating the Qwen2.5-VL-3B model on `VideoMathQA` for multi-binary accuracy. The other available `VideoMathQA` tasks are:
46+
47+
1. videomathqa\_mcq
48+
2. videomathqa\_mcq\_w\_subtitles
49+
3. videomathqa\_mcq\_cot
50+
4. videomathqa\_mcq\_cot\_w\_subtitles
51+
5. videomathqa\_mbin
52+
6. videomathqa\_mbin\_w\_subtitles
53+
7. videomathqa\_mbin\_cot
54+
8. videomathqa\_mbin\_cot\_w\_subtitles
55+
56+
`w_subtitles` tasks additionally use subtitles during evaluation. `cot` tasks prompt the model to think step-by-step before answering the question.
57+
58+
59+
## Post Processing
60+
- For tasks with CoT prompting (`_cot`), model outputs typically contain both reasoning and the final answer.
61+
- To enable standardized scoring, we post-process the responses using Qwen-3-4B (in non-thinking mode) to extract only the final answer. This ensures format consistency and removes ambiguity in final answer extraction.
62+
63+
```shell
64+
# Install VLLM
65+
pip install vllm
66+
67+
# Run post-processing
68+
python videomathqa/cot_postprocess.py --input_file <path/to/your/raw_cot_results.jsonl> --output_file <path/to/save/processed_results.jsonl>
69+
```
70+
71+
## CoT Step Evaluation
72+
73+
We provide a [VLLM](https://github.com/vllm-project/vllm)-based script to run CoT step evaluation after inference. The self-contained script is available at [cot\_step\_evaluation.py](cot_step_evaluation.py).
74+
75+
```shell
76+
# Install VLLM
77+
pip install vllm
78+
79+
# Run CoT step evaluation
80+
python videomathqa/cot_step_evaluation.py --gt_file <path/to/the/annotation/parquet_file> --res_file <path/to/the/results/file/generated/after/running/inference/using/lmms_eval>
81+
```
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
import os
2+
import re
3+
import sys
4+
import json
5+
import random
6+
import argparse
7+
from tqdm import tqdm
8+
from vllm import LLM, SamplingParams
9+
from transformers import AutoTokenizer
10+
from videomathqa.utils import (extract_characters_regex,
11+
videomathqa_process_results,
12+
videomathqa_mcq_aggregate_results,
13+
videomathqa_multi_binary_aggregate_results)
14+
15+
16+
mcq_prompt = (
17+
"Given the original multiple-choice options and a model-generated answer containing reasoning and a final answer, identify the option that best matches the final answer and return only the corresponding letter (A, B, C, D, or E)."
18+
)
19+
mbin_prommpt = "Given the original binary options and a model-generated answer containing reasoning and a final answer, identify the option that best matches the final answer and return only the corresponding letter (A or B)."
20+
21+
22+
def extract_choice_vllm(llm, sampling_params, tokenizer, model_prompt, mcq=True):
23+
if mcq:
24+
prompt_type = mcq_prompt
25+
else:
26+
prompt_type = mbin_prommpt
27+
chat_prompt = [
28+
{
29+
"role": "user",
30+
"content": f"""{prompt_type}:
31+
32+
Text:
33+
{model_prompt}
34+
35+
Only return the letter A, B, C, D, or E. If none is found, return "None".""",
36+
}
37+
]
38+
text = tokenizer.apply_chat_template(chat_prompt, tokenize=False, add_generation_prompt=True, enable_thinking=False)
39+
output = llm.generate([text], sampling_params=sampling_params)
40+
reply = output[0].outputs[0].text.strip().upper()
41+
if mcq:
42+
if re.fullmatch(r"[A-E]", reply):
43+
return reply
44+
else:
45+
if re.fullmatch(r"[A-B]", reply):
46+
return reply
47+
return None
48+
49+
50+
def refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq=True):
51+
raw_samples = []
52+
with open(sample_jsonl, "r") as f:
53+
for line in f:
54+
raw_samples.append(json.loads(line))
55+
print(f"Loaded {len(raw_samples)} samples from {sample_jsonl}")
56+
57+
updated_samples = []
58+
for sample in tqdm(raw_samples, desc="Postprocessing samples with Qwen"):
59+
options = sample["doc"]["options"]
60+
raw_pred = sample["resps"][0][0]
61+
input_text = f"The options are: {options}\n\n The model response is: {raw_pred}"
62+
try:
63+
choice = extract_choice_vllm(llm, sampling_params, tokenizer, input_text, mcq)
64+
except Exception as e:
65+
choice = None
66+
if choice is None:
67+
answer = sample["target"]
68+
if mcq:
69+
options = ["A", "B", "C", "D", "E"]
70+
else:
71+
options = ["A", "B"]
72+
options.remove(answer)
73+
random.shuffle(options)
74+
choice = options[0]
75+
sample["resps"][0][0] = choice
76+
updated_samples.append(sample)
77+
78+
with open(output_jsonl, "w") as f:
79+
for sample in updated_samples:
80+
f.write(json.dumps(sample) + "\n")
81+
print(f"Saved {len(updated_samples)} updated samples to {output_jsonl}")
82+
return updated_samples
83+
84+
85+
def postprocess_jsonl(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl):
86+
if "mcq" in sample_jsonl:
87+
mcq = True
88+
elif "mbin" in sample_jsonl:
89+
mcq = False
90+
91+
updated_samples = refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq)
92+
93+
print(f"Computing score ...")
94+
processed = []
95+
for item in tqdm(updated_samples, desc="Computing scores..."):
96+
pred_raw = item["resps"][0][0] if isinstance(item["resps"][0], list) else item["resps"][0]
97+
pred_clean = extract_characters_regex(pred_raw)
98+
item["filtered_resps"] = [pred_clean]
99+
result = videomathqa_process_results(item["doc"], [pred_clean])
100+
processed.append(result["videomathqa_perception_score"])
101+
102+
if mcq:
103+
final_score = videomathqa_mcq_aggregate_results(processed)
104+
else:
105+
final_score = videomathqa_multi_binary_aggregate_results(processed)
106+
print(f"Final Postprocessed VideoMathQA Score: {final_score:.2f}")
107+
print(f"Saved {len(updated_samples)} updated samples to {output_jsonl}")
108+
109+
110+
def main():
111+
parser = argparse.ArgumentParser(description="Postprocess a CoT predictions using the Qwen model.")
112+
parser.add_argument("--input_file", type=str, required=True, help="Path to the input JSONL file.")
113+
parser.add_argument("--output_file", type=str, required=True, help="Path to save the postprocessed output JSONL file.")
114+
parser.add_argument("--model_path", type=str, default="Qwen/Qwen3-4B", help="Path to the pretrained Qwen model (default: Qwen3-4B).")
115+
116+
args = parser.parse_args()
117+
118+
if not os.path.exists(args.input_file):
119+
print(f"Input file '{args.input_file}' does not exist.")
120+
return
121+
122+
if os.path.exists(args.output_file):
123+
print(f"Output file '{args.output_file}' already exists. Skipping.")
124+
return
125+
126+
print("Loading Qwen-3 model...")
127+
tokenizer = AutoTokenizer.from_pretrained(args.model_path)
128+
llm = LLM(model=args.model_path)
129+
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, min_p=0, max_tokens=16)
130+
131+
print(f"Processing {args.input_file} ...")
132+
postprocess_jsonl(llm, sampling_params, tokenizer, args.input_file, args.output_file)
133+
print(f"Saved postprocessed output to {args.output_file}")
134+
135+
136+
if __name__ == "__main__":
137+
main()

0 commit comments

Comments
 (0)