EvolvingLMMs-Lab
diff --git a/‎README.md
Lines changed: 1 addition & 0 deletions b/‎README.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎lmms_eval/tasks/videomathqa/README.md
Lines changed: 81 additions & 0 deletions b/‎lmms_eval/tasks/videomathqa/README.md
Lines changed: 81 additions & 0 deletions
diff --git a/‎lmms_eval/tasks/videomathqa/cot_postprocess.py
Lines changed: 137 additions & 0 deletions b/‎lmms_eval/tasks/videomathqa/cot_postprocess.py
Lines changed: 137 additions & 0 deletions
@@ -33,6 +33,7 @@
 <details>
 <summary>We warmly welcome contributions from the open-source community! Below is a chronological list of recent tasks, models, and features added by our amazing contributors. </summary>
 
+- [2025-06] 🎉🎉 We welcome the new task [VideoMathQA](https://mbzuai-oryx.github.io/VideoMathQA), designed to evaluate mathematical reasoning in real-world educational videos.
 - [2024-10] 🎉🎉 We welcome the new task [NaturalBench](https://huggingface.co/datasets/BaiqiL/NaturalBench), a vision-centric VQA benchmark (NeurIPS'24) that challenges vision-language models with simple questions about natural imagery.
 - [2024-10] 🎉🎉 We welcome the new task [TemporalBench](https://huggingface.co/datasets/microsoft/TemporalBench) for fine-grained temporal understanding and reasoning for videos, which reveals a huge (>30%) human-AI gap.
 - [2024-10] 🎉🎉 We welcome the new tasks [VDC](https://rese1f.github.io/aurora-web/) for video detailed captioning, [MovieChat-1K](https://rese1f.github.io/MovieChat/) for long-form video understanding, and [Vinoground](https://vinoground.github.io/), a temporal counterfactual LMM benchmark composed of 1000 short natural video-caption pairs. We also welcome the new models: [AuroraCap](https://github.com/rese1f/aurora) and [MovieChat](https://github.com/rese1f/MovieChat).
 
@@ -0,0 +1,81 @@
+# VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
+VideoMathQA is a benchmark designed to evaluate mathematical reasoning in real-world educational videos. It requires models to interpret and integrate information from three modalities, visuals, audio, and text, across time. The benchmark tackles the “needle-in-a-multimodal-haystack” problem, where key information is sparse and spread across different formats and moments in the video.
+
+[![Website](https://img.shields.io/badge/🌐_Project-Website-87CEEB)](https://mbzuai-oryx.github.io/VideoMathQA)
+[![Dataset](https://img.shields.io/badge/🤗_Dataset-Access-green)](https://huggingface.co/datasets/MBZUAI/VideoMathQA)
+[![🏅 Leaderboard (Reasoning)](https://img.shields.io/badge/🏅_Leaderboard-Reasoning-red)](https://hanoonar.github.io/VideoMathQA/#leaderboard-2)
+[![🏅 Leaderboard (Direct)](https://img.shields.io/badge/🏅_Leaderboard-Direct-yellow)](https://hanoonar.github.io/VideoMathQA/#leaderboard)
+[![GitHub](https://img.shields.io/badge/📂_GitHub-VideoMathQA-green)](https://github.com/mbzuai-oryx/VideoMathQA)
+
+## Evaluation Strategies
+
+**VideoMathQA** supports the following **evaluation strategies** to comprehensively assess model performance:
+
+1. **MCQ and Multi-Binary (MBin)**  
+   - Tasks with `mcq` use a 5-way multiple-choice format.  
+   - Tasks with `mbin` use a stricter binary-pairwise evaluation format (correct vs each distractor).  
+   - Both formats are available *with* and *without subtitles*, indicated by `_w_subtitles` in the task name.
+
+2. **Direct Answering vs. Chain-of-Thought (CoT)**  
+   - Each task can be evaluated under **Direct** or **CoT** prompting.  
+   - Tasks containing `_cot` use CoT prompting, where models generate reasoning before the final answer.  
+   - Direct answering tasks expect the final answer only, without intermediate reasoning.  
+   - CoT tasks require post-processing to extract the final answer (see [Post Processing](#post-processing)).  
+   - We maintain **separate leaderboards** for Direct and CoT settings.
+
+3. **Step-wise CoT Evaluation**  
+   - For CoT tasks, we additionally evaluate the quality of generated reasoning.  
+   - Each response is scored by comparing against annotated solution steps (typically 4–10 steps).  
+   - Scoring is done using a small open-source model (Qwen-3-4B in thinking mode), which returns a score (0–10) and rationale.
+
+
+## Run Evaluation
+
+Please run the following command to start evaluation.
+
+```python
+accelerate launch --num_processes=8 -m lmms_eval \
+    --model qwen2_5_vl \
+    --model_args=pretrained=Qwen/Qwen2.5-VL-7B-Instruct,max_pixels=151200,min_pixels=100352,use_flash_attention_2=True,device_map=auto \
+    --tasks videomathqa_mbin \
+    --batch_size 1 --log_samples --log_samples_suffix qwen_2_5_vl \
+    --output_path output
+```
+
+This command starts evaluating the Qwen2.5-VL-3B model on `VideoMathQA` for multi-binary accuracy. The other available `VideoMathQA` tasks are:
+
+1. videomathqa\_mcq
+2. videomathqa\_mcq\_w\_subtitles
+3. videomathqa\_mcq\_cot
+4. videomathqa\_mcq\_cot\_w\_subtitles
+5. videomathqa\_mbin
+6. videomathqa\_mbin\_w\_subtitles
+7. videomathqa\_mbin\_cot
+8. videomathqa\_mbin\_cot\_w\_subtitles
+
+`w_subtitles` tasks additionally use subtitles during evaluation. `cot` tasks prompt the model to think step-by-step before answering the question.
+
+
+## Post Processing
+- For tasks with CoT prompting (`_cot`), model outputs typically contain both reasoning and the final answer.
+- To enable standardized scoring, we post-process the responses using Qwen-3-4B (in non-thinking mode) to extract only the final answer. This ensures format consistency and removes ambiguity in final answer extraction.
+
+```shell
+# Install VLLM
+pip install vllm
+
+# Run post-processing
+python videomathqa/cot_postprocess.py --input_file <path/to/your/raw_cot_results.jsonl> --output_file <path/to/save/processed_results.jsonl>
+```
+
+## CoT Step Evaluation
+
+We provide a [VLLM](https://github.com/vllm-project/vllm)-based script to run CoT step evaluation after inference. The self-contained script is available at [cot\_step\_evaluation.py](cot_step_evaluation.py).
+
+```shell
+# Install VLLM
+pip install vllm
+
+# Run CoT step evaluation
+python videomathqa/cot_step_evaluation.py --gt_file <path/to/the/annotation/parquet_file> --res_file <path/to/the/results/file/generated/after/running/inference/using/lmms_eval>
+```
@@ -0,0 +1,137 @@
+import os
+import re
+import sys
+import json
+import random
+import argparse
+from tqdm import tqdm
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+from videomathqa.utils import (extract_characters_regex,
+                            videomathqa_process_results,
+                            videomathqa_mcq_aggregate_results,
+                            videomathqa_multi_binary_aggregate_results)
+
+
+mcq_prompt = (
+    "Given the original multiple-choice options and a model-generated answer containing reasoning and a final answer, identify the option that best matches the final answer and return only the corresponding letter (A, B, C, D, or E)."
+)
+mbin_prommpt = "Given the original binary options and a model-generated answer containing reasoning and a final answer, identify the option that best matches the final answer and return only the corresponding letter (A or B)."
+
+
+def extract_choice_vllm(llm, sampling_params, tokenizer, model_prompt, mcq=True):
+    if mcq:
+        prompt_type = mcq_prompt
+    else:
+        prompt_type = mbin_prommpt
+    chat_prompt = [
+        {
+            "role": "user",
+            "content": f"""{prompt_type}:
+
+Text:
+{model_prompt}
+
+Only return the letter A, B, C, D, or E. If none is found, return "None".""",
+        }
+    ]
+    text = tokenizer.apply_chat_template(chat_prompt, tokenize=False, add_generation_prompt=True, enable_thinking=False)
+    output = llm.generate([text], sampling_params=sampling_params)
+    reply = output[0].outputs[0].text.strip().upper()
+    if mcq:
+        if re.fullmatch(r"[A-E]", reply):
+            return reply
+    else:
+        if re.fullmatch(r"[A-B]", reply):
+            return reply
+    return None
+
+
+def refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq=True):
+    raw_samples = []
+    with open(sample_jsonl, "r") as f:
+        for line in f:
+            raw_samples.append(json.loads(line))
+    print(f"Loaded {len(raw_samples)} samples from {sample_jsonl}")
+
+    updated_samples = []
+    for sample in tqdm(raw_samples, desc="Postprocessing samples with Qwen"):
+        options = sample["doc"]["options"]
+        raw_pred = sample["resps"][0][0]
+        input_text = f"The options are: {options}\n\n The model response is: {raw_pred}"
+        try:
+            choice = extract_choice_vllm(llm, sampling_params, tokenizer, input_text, mcq)
+        except Exception as e:
+            choice = None
+        if choice is None:
+            answer = sample["target"]
+            if mcq:
+                options = ["A", "B", "C", "D", "E"]
+            else:
+                options = ["A", "B"]
+            options.remove(answer)
+            random.shuffle(options)
+            choice = options[0]
+        sample["resps"][0][0] = choice
+        updated_samples.append(sample)
+
+    with open(output_jsonl, "w") as f:
+        for sample in updated_samples:
+            f.write(json.dumps(sample) + "\n")
+    print(f"Saved {len(updated_samples)} updated samples to {output_jsonl}")
+    return updated_samples
+
+
+def postprocess_jsonl(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl):
+    if "mcq" in sample_jsonl:
+        mcq = True
+    elif "mbin" in sample_jsonl:
+        mcq = False
+
+    updated_samples = refine_samples_vllm(llm, sampling_params, tokenizer, sample_jsonl, output_jsonl, mcq)
+
+    print(f"Computing score ...")
+    processed = []
+    for item in tqdm(updated_samples, desc="Computing scores..."):
+        pred_raw = item["resps"][0][0] if isinstance(item["resps"][0], list) else item["resps"][0]
+        pred_clean = extract_characters_regex(pred_raw)
+        item["filtered_resps"] = [pred_clean]
+        result = videomathqa_process_results(item["doc"], [pred_clean])
+        processed.append(result["videomathqa_perception_score"])
+
+    if mcq:
+        final_score = videomathqa_mcq_aggregate_results(processed)
+    else:
+        final_score = videomathqa_multi_binary_aggregate_results(processed)
+    print(f"Final Postprocessed VideoMathQA Score: {final_score:.2f}")
+    print(f"Saved {len(updated_samples)} updated samples to {output_jsonl}")
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Postprocess a CoT predictions using the Qwen model.")
+    parser.add_argument("--input_file", type=str, required=True, help="Path to the input JSONL file.")
+    parser.add_argument("--output_file", type=str, required=True, help="Path to save the postprocessed output JSONL file.")
+    parser.add_argument("--model_path", type=str, default="Qwen/Qwen3-4B", help="Path to the pretrained Qwen model (default: Qwen3-4B).")
+
+    args = parser.parse_args()
+
+    if not os.path.exists(args.input_file):
+        print(f"Input file '{args.input_file}' does not exist.")
+        return
+
+    if os.path.exists(args.output_file):
+        print(f"Output file '{args.output_file}' already exists. Skipping.")
+        return
+
+    print("Loading Qwen-3 model...")
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    llm = LLM(model=args.model_path)
+    sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, min_p=0, max_tokens=16)
+
+    print(f"Processing {args.input_file} ...")
+    postprocess_jsonl(llm, sampling_params, tokenizer, args.input_file, args.output_file)
+    print(f"Saved postprocessed output to {args.output_file}")
+
+
+if __name__ == "__main__":
+    main()