ARC-AGI-v1 Benchmark Pipeline #169

ilmarinen · 2025-07-08T20:51:07Z

No description provided.

nushib · 2025-07-11T21:36:01Z

eureka_ml_insights/configs/model_configs.py

+    AzureOpenAIModel,
+    {
+        "model_name": "gpt-4o",
+        "url": "https://eurekaevals.openai.azure.com/",


let's replace this with a placeholder url

nushib · 2025-07-11T22:16:13Z

eureka_ml_insights/data_utils/arc_agi_utils.py

+
+
+@dataclass
+class ARCAGI_CleanCOTAnswer(DFTransformBase):


Can we put this in the general transforms so others can use the same if they need to for other benchmarks? This is because cleaning COTs is not necessarily arc agi specific. Also, here is another way how we did it for other benchmarks if useful

eureka-ml-insights/eureka_ml_insights/user_configs/ifeval.py

Line 229 in 9b9796d

self.evalreporting_comp.data_reader_config.init_args["transform"].transforms.append(

nushib · 2025-07-11T22:19:06Z

eureka_ml_insights/prompt_templates/arc_agi_templates/arc_agi_v1_grid_explanation.jinja

+
+With that in mind, do your best to solve the question below.
+
+{{ prompt }}


So the prompt itself does not ask the model to format the answer in . However the answer extraction sort of assumes this. Is this because reasoning models are expected to have this format? What if the model uses other tags or no tags at all (e.g. if it is a conventional model with no thinking block)? For other reasoning tasks, we ask the model to clearly mark the final answer according to some format. For example:

Final Answer:

and then extract what comes after that.

nushib · 2025-07-11T22:24:46Z

eureka_ml_insights/user_configs/arc_agi.py

+            max_concurrent=1,
+        )
+
+        if resume_logdir:


is this to over write results on the same directory? If so this is somewhat of an unexpected behavior because the other pipelines do not do this by default.

nushib · 2025-07-11T22:25:26Z

eureka_ml_insights/user_configs/arc_agi.py

+                    "path": os.path.join(self.inference_comp.output_dir, "inference_result.jsonl"),
+                    "format": ".jsonl",
+                    "transform": SequenceTransform(
+                        []


Was there supposed to be a transformation here? If not, maybe remove the whole component altogether?

I read your comment as to why you have an empty post-processing here and would like to point out that this is exactly why we have another data reader in the eval reporting component, that would be where these potential transforms go.

eureka_ml_insights/user_configs/arc_agi.py

nushib · 2025-07-11T22:47:48Z

eureka_ml_insights/user_configs/arc_agi.py

+        return pipeline
+
+
+class COT_ARC_AGI_v1_PIPELINE_5Run(ARC_AGI_v1_PIPELINE):


Should this inherit from COT_ARC_AGI_v1_PIPELINE instead?

nushib · 2025-07-11T22:48:00Z

eureka_ml_insights/user_configs/arc_agi.py

+
+
+class ARC_AGI_v1_PIPELINE_5Run(ARC_AGI_v1_PIPELINE):
+    """This class specifies the config for running the GPQA benchmark 5 repeated times"""


stale comment. Maybe search the whole file for gpqa

nushib · 2025-07-11T22:49:57Z

eureka_ml_insights/user_configs/arc_agi.py

+                    "path": os.path.join(self.inference_comp.output_dir, "inference_result.jsonl"),
+                    "format": ".jsonl",
+                    "transform": SequenceTransform(
+                        []


empty transform

nushib · 2025-07-11T22:52:37Z

eureka_ml_insights/user_configs/arc_agi.py

+            data_reader_config=DataSetConfig(
+                HFDataReader,
+                {
+                   "path": "pxferna/ARC-AGI-v1-5050",


is this pipeline the same as the previous one but the HF file changes? If so, it is possible to just inherit from the previous one and only change the path of the HF data reader. I did notice however that this one also has majority vote and worst of n, while the other one does not.

safooray · 2025-07-14T21:47:49Z

eureka_ml_insights/user_configs/arc_agi.py

@@ -0,0 +1,347 @@
+import os


Please add a pipeline test for this new pipeline under the tests folder.

safooray · 2025-07-14T21:49:04Z

eureka_ml_insights/data_utils/arc_agi_utils.py

+        Parameters:
+            response (str): Input string containing answer X in the form of "<output>final answer string</output>".
+        Returns: 
+            answer (str): The final answer string with leading and training spaces stripped.


training -> trailing

safooray · 2025-07-14T21:52:32Z

eureka_ml_insights/data_utils/transform.py

+    """
+
+    model_output_column: str
+    model_answer_column: str


could you please make the think tag an argument since this is a general transform now?

safooray · 2025-07-14T21:54:09Z

eureka_ml_insights/user_configs/arc_agi.py

+    ColumnRename,
+    CopyColumn,
+    ExtractUsageTransform,
+    MajorityVoteTransform,


it looks like there are unused imports here. please run our formatting commands to clean up the code. See "how to contribute" in readme. TLDR: make format-inplace

ilmarinen requested review from nushib and safooray July 8, 2025 20:51

nushib suggested changes Jul 11, 2025

View reviewed changes

nushib previously approved these changes Jul 14, 2025

View reviewed changes

safooray requested changes Jul 14, 2025

View reviewed changes

safooray reviewed Jul 14, 2025

View reviewed changes

safooray requested changes Jul 14, 2025

View reviewed changes

ilmarinen dismissed nushib’s stale review via a5fcc6f July 15, 2025 05:51

ilmarinen added 8 commits August 12, 2025 11:17

Add ARC-AGI-v1 dataset experiment pipeline

caa4570

Add metrics and aggregators for ARC AGI pipeline

c6e6efd

Add ARC AGI 5run pipeline

a8f0e4b

Add Phi specific pipelines that filter out COT

4c7aa34

Add DeepSeek R1 VLLM model config

6e49d40

Add prompt template with grid explanation

c347ff3

Rename Phie experiment pipelines to be COT pipelines

97301d3

Add unit tests

4198c42

ilmarinen force-pushed the xfernandes/ARCAGI branch from a5fcc6f to 4198c42 Compare August 12, 2025 18:31


		With that in mind, do your best to solve the question below.

		{{ prompt }} No newline at end of file

		return pipeline


		class COT_ARC_AGI_v1_PIPELINE_5Run(ARC_AGI_v1_PIPELINE):



		class ARC_AGI_v1_PIPELINE_5Run(ARC_AGI_v1_PIPELINE):
		"""This class specifies the config for running the GPQA benchmark 5 repeated times"""

ARC-AGI-v1 Benchmark Pipeline #169

Are you sure you want to change the base?

ARC-AGI-v1 Benchmark Pipeline #169

Uh oh!

Conversation

ilmarinen commented Jul 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

safooray Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

safooray Jul 14, 2025 •

edited

Loading