use llm judge for math-v as with vista and verse (#163)

michaelharrisonmai · Michael Harrison · web-flow · commit 6830fca641ed · 2025-06-27T17:05:55.000-07:00
Update Math-V to use LLM judge, since the previous answer extraction was
not reliable. Uses same few shot extraction/scoring templates at
MathVerse and MathVista. Ran on several models already, with scores
roughly matching published scores on MathV leaderboard.

---------

Co-authored-by: Michael Harrison &lt;mharrison@microsoft.com&gt;
diff --git a/eureka_ml_insights/prompt_templates/mathvision_templates/answer_extraction_prompt.jinja b/eureka_ml_insights/prompt_templates/mathvision_templates/answer_extraction_prompt.jinja
@@ -0,0 +1,29 @@
+I am providing you a response from a model to a math problem, termed 'Model Response'. You should extract the answer from the response. Directly output the extracted answer with no explanation.
+
+1.
+Model response: 'Rounded to two decimal places, the perimeter of the sector is approximately:\n\n(-2, 1)'
+Extracted Answer: (-2, 1)
+
+2.
+Model response: 'at those points.\n\nTherefore, the correct option that represents the meaning of the intersection points of the graphs is:\n\nD. They give the solutions to the equation $f(t)=g(t)$.",'
+Extracted Answer: D
+
+3.
+Model response: ' at 1 (there's a closed circle at y = 1), the range in interval notation is \\((-4, 1]\\).\n\nFinal values:\nDomain: \\((-3, 3]\\)\nRange: \\((-4, 1]\\)'
+Extracted Answer: Domain: \\((-3, 3]\\)\nRange: \\((-4, 1]\\)
+
+4.
+Model response: 'As it stands, I cannot provide the correct option letter because there isn't enough information to solve for 'y'.'
+Extracted Answer: null
+
+5.
+Model response: 'Given that AB = 17.6 meters, we can now substitute into the equation:\n\nd = 17.6 / cos(38\u00b0)\n\nTherefore, to one decimal place, the distance d between Ned and Bart is approximately 22.3 meters.'
+Extracted answer: 22.3
+
+6.
+Model response:  have all the coefficients for the quadratic function:\n\\( f(x) = ax^2 + bx + c \\)\n\\( f(x) = -1x^2 - 2x + 1 \\)\n\nTherefore, the equation for the graphed function \\( f \\) is:\n\\( f(x) = -x^2 - 2x + 1 \\)"'
+Extracted answer: f(x) = -x^2 - 2x + 1
+
+7.
+Model response: {{ response }}
+Extracted Answer: 
diff --git a/eureka_ml_insights/prompt_templates/mathvision_templates/scoring_prompt.jinja b/eureka_ml_insights/prompt_templates/mathvision_templates/scoring_prompt.jinja
@@ -0,0 +1,33 @@
+Below are two answers to a math question. Question is [Question], [Standard Answer] is the standard answer to the question, and [Model_answer] is the answer extracted from a model's output to this question.  Determine whether these two answers are consistent.
+Please note that only when the [Model_answer] completely matches the [Standard Answer] means they are consistent. For multiple choice questions, consider that the model answer may contain the letter representing the answer value. For non-multiple-choice questions, if the meaning is expressed in the same way, it is also considered consistent, for example, 0.5m and 50cm.
+If they are consistent, Judgement is 1; if they are different, Judgement is 0.  
+
+[Question]: Write the set of numbers represented on the number line in interval notation.
+[Standard Answer]: (-2,1]
+[Model_answer] : Extracted Answer: \\((-2, 1)\\)
+Judgement: 0
+
+[Question]: As shown in the figure, circle O has a radius 1.0, if angle BAC = 60.0, then the length of BC is ()\nChoices:\nA:2\nB:2\u221a{{3}}\nC:\u221a{{3}}\nD:2\u221a{{2}}
+[Standard Answer]: C
+[Model_answer] : B:2\u221a{{3}}
+Judgement: 0
+
+[Question]: Find the domain and range of the function f using interval notation.
+[Standard Answer]: domain: [-4, 0) and range: (-3, 1]
+[Model_answer] : Range: \\((-4, 1]\\)
+Judgement: 0
+
+[Question]: As shown in the figure, circle O has a radius 1.0, if angle BAC = 60.0, then the length of BC is ()\nChoices:\nA:2\nB:2\u221a{{3}}\nC:\u221a{{3}}\nD:2\u221a{{2}}
+[Standard Answer]: C
+[Model_answer] : null
+Judgement: 0
+
+[Question]: Given the graph of the line that intersects with x-axis at -3 and with y-axis at 4, determine its equation. A. y = \\frac{{4}}{{3}}x + 4 B. Cannot determine.\n
+[Standard Answer]: A
+[Model_answer] : y = \\frac{{4}}{{3}}x + 4
+Judgement: 1
+
+[Question]: {{original_question}}
+[Standard Answer]: {{answer}}
+[Model_answer] : {{extraction}}
+Judgement: 
diff --git a/eureka_ml_insights/user_configs/mathvision.py b/eureka_ml_insights/user_configs/mathvision.py
@@ -5,13 +5,12 @@
 from typing import Any
 
 from eureka_ml_insights.core import (
-    DataProcessing,
     EvalReporting,
     Inference,
     PromptProcessing
 )
 from eureka_ml_insights.data_utils import (
-    AddColumn,
+    ColumnRename,
     CopyColumn,
     DataReader,
     HFDataReader,
@@ -23,7 +22,6 @@
 
 from eureka_ml_insights.configs import(
     AggregatorConfig,
-    DataProcessingConfig,
     DataSetConfig,
     EvalReportingConfig,
     InferenceConfig,
@@ -32,10 +30,11 @@
     PromptProcessingConfig,
 )
 
-from eureka_ml_insights.data_utils.mathvision_utils import MathVisionOutputEvaluator
 from eureka_ml_insights.metrics.reports import AverageAggregator
 from eureka_ml_insights.configs import ExperimentConfig
 
+from eureka_ml_insights.configs.model_configs import OAI_GPT4_1106_PREVIEW_CONFIG as PERSONAL_GPT4O
+
 class MATHVISION_PIPELINE(ExperimentConfig):
     def configure_pipeline(
         self, model_config: ModelConfig, resume_from: str = None, **kwargs: dict[str, Any]
@@ -56,7 +55,6 @@ def configure_pipeline(
                                 columns='options_string',
                                 mapping=lambda x: "" if len(x)==0 else ("\n[Options]:\n" + '\n'.join([chr(ord('A') + i) + ". " + opt for i, opt in enumerate(x)]))
                             ),
-                            #SamplerTransform(sample_count=2, random_seed=1234),
                         ]
                     ),
                 },
@@ -82,32 +80,73 @@ def configure_pipeline(
             resume_from=resume_from,
         )
 
-        # post process the response to extract the answer
-        self.data_post_processing = DataProcessingConfig(
-            component_type=DataProcessing,
+        # Eval data pre processing component round 1 (answer extraction).
+        self.eval_data_pre_processing = PromptProcessingConfig(
+            component_type=PromptProcessing,
             data_reader_config=DataSetConfig(
                 DataReader,
                 {
                     "path": os.path.join(self.inference_comp.output_dir, "inference_result.jsonl"),
                     "format": ".jsonl",
-                    "transform": SequenceTransform(
-                        [
-                            AddColumn("score"),
-                            MathVisionOutputEvaluator(score_column_name="score"),
-                        ]
-                    ),
+                    "transform": SequenceTransform([
+                        ColumnRename(name_mapping={"model_output": "response"}),
+                        ColumnRename(name_mapping={"prompt": "original_question"})
+                    ]),
                 },
             ),
-            output_dir=os.path.join(self.log_dir, "data_post_processing_output"),
+            prompt_template_path=os.path.join(
+                os.path.dirname(__file__), "../prompt_templates/mathvision_templates/answer_extraction_prompt.jinja"
+            ),
+            output_dir=os.path.join(self.log_dir, "eval_data_pre_processing_output"),
+        )
+
+        # Eval Inference component round 1 (answer extraction).
+        self.eval_inference_comp = InferenceConfig(
+            component_type=Inference,
+            model_config=PERSONAL_GPT4O,
+            data_loader_config=DataSetConfig(
+                MMDataLoader,
+                {"path": os.path.join(self.eval_data_pre_processing.output_dir, "transformed_data.jsonl"), "load_images":False},
+            ),
+            output_dir=os.path.join(self.log_dir, "eval_inference_result"),
+        )
+
+        # Eval data pre processing component round 2 (LLM scoring).
+        self.eval_data_pre_processing_two = PromptProcessingConfig(
+            component_type=PromptProcessing,
+            data_reader_config=DataSetConfig(
+                DataReader,
+                {
+                    "path": os.path.join(self.eval_inference_comp.output_dir, "inference_result.jsonl"),
+                    "format": ".jsonl",
+                    "transform": SequenceTransform([ColumnRename(name_mapping={"model_output": "extraction"})]),
+                },
+            ),
+            prompt_template_path=os.path.join(
+                os.path.dirname(__file__), "../prompt_templates/mathvision_templates/scoring_prompt.jinja"
+            ),
+            output_dir=os.path.join(self.log_dir, "eval_data_pre_processing_output_two"),
+        )
+
+        # Eval Inference component round 2 (LLM scoring)
+        self.eval_inference_comp_two = InferenceConfig(
+            component_type=Inference,
+            model_config=PERSONAL_GPT4O,
+            data_loader_config=DataSetConfig(
+                MMDataLoader,
+                {"path": os.path.join(self.eval_data_pre_processing_two.output_dir, "transformed_data.jsonl"), "load_images":False},
+            ),
+            output_dir=os.path.join(self.log_dir, "eval_inference_result_two"),
         )
 
         self.evalreporting_comp = EvalReportingConfig(
             component_type=EvalReporting,
             data_reader_config=DataSetConfig(
                 DataReader,
                 {
-                    "path": os.path.join(self.data_post_processing.output_dir, "transformed_data.jsonl"),
+                    "path": os.path.join(self.eval_inference_comp_two.output_dir, "inference_result.jsonl"),
                     "format": ".jsonl",
+                    "transform": ColumnRename(name_mapping={"model_output": "score"}),
                 },
             ),
             aggregator_configs=[
@@ -122,8 +161,16 @@ def configure_pipeline(
                     AverageAggregator,
                     {
                         "column_names": ["score"],
-                        "filename_base": "MathVision_Score_By_Type",
-                        "group_by": ["level", "subject"],
+                        "filename_base": "MathVision_Score_By_Subect",
+                        "group_by": ["subject"],
+                    },
+                ),
+                AggregatorConfig(
+                    AverageAggregator,
+                    {
+                        "column_names": ["score"],
+                        "filename_base": "MathVision_Score_By_SubectLevel",
+                        "group_by": ["subject", "level"],
                     },
                 ),
             ],
@@ -134,7 +181,10 @@ def configure_pipeline(
             [
                 self.data_processing_comp,
                 self.inference_comp,
-                self.data_post_processing,
+                self.eval_data_pre_processing,
+                self.eval_inference_comp,
+                self.eval_data_pre_processing_two,
+                self.eval_inference_comp_two,
                 self.evalreporting_comp,
             ],
             self.log_dir,