-
Notifications
You must be signed in to change notification settings - Fork 35
ARC-AGI-v1 Benchmark Pipeline #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
caa4570
c6e6efd
a8f0e4b
4c7aa34
6e49d40
c347ff3
97301d3
4198c42
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,69 @@ | ||||
| import re | ||||
| from dataclasses import dataclass | ||||
|
|
||||
| import pandas as pd | ||||
|
|
||||
| from .transform import DFTransformBase | ||||
|
|
||||
|
|
||||
| @dataclass | ||||
| class ARCAGI_ExtractAnswer(DFTransformBase): | ||||
| model_output_column: str | ||||
| model_answer_column: str | ||||
|
|
||||
| def transform(self, df: pd.DataFrame) -> pd.DataFrame: | ||||
| df[self.model_answer_column] = df[self.model_output_column].apply(self.parse_output_answer) | ||||
| return df | ||||
|
|
||||
| @staticmethod | ||||
| def parse_output_answer(response): | ||||
| """ | ||||
| Parse the input string to extract answer of a given ARCAGI question. | ||||
| Parameters: | ||||
| response (str): Input string containing answer X in the form of "<output>final answer string</output>". | ||||
| Returns: | ||||
| answer (str): The final answer string with leading and training spaces stripped. | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. training -> trailing |
||||
| """ | ||||
| answer = "" | ||||
|
|
||||
| if response is None: | ||||
| return "" | ||||
| elif response.find("<output>") == -1 or response.find("</output>") == -1: | ||||
| return "" | ||||
|
|
||||
| start_index = response.find("<output>") + len("<output>") | ||||
| end_index = response.find("</output>") | ||||
|
|
||||
| answer = response[start_index:end_index].strip() | ||||
|
|
||||
| return answer | ||||
|
|
||||
|
|
||||
| @dataclass | ||||
| class ARCAGI_CleanCOTAnswer(DFTransformBase): | ||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we put this in the general transforms so others can use the same if they need to for other benchmarks? This is because cleaning COTs is not necessarily arc agi specific. Also, here is another way how we did it for other benchmarks if useful
|
||||
| model_output_column: str | ||||
| model_answer_column: str | ||||
|
|
||||
| def transform(self, df: pd.DataFrame) -> pd.DataFrame: | ||||
| df[self.model_answer_column] = df[self.model_output_column].apply(self.parse_output_answer) | ||||
| return df | ||||
|
|
||||
| @staticmethod | ||||
| def parse_output_answer(response): | ||||
| """ | ||||
| Replace None responses with an empty string | ||||
| Parameters: | ||||
| response (str): Possibly None Response string | ||||
| Returns: | ||||
| answer (str): Response string with None replaced by blank string | ||||
| """ | ||||
| if response is None: | ||||
| return "" | ||||
|
|
||||
| start_index = response.find("</think>") + len("</think>") | ||||
| if start_index == -1: | ||||
| return response | ||||
|
|
||||
| response = response[start_index:] | ||||
|
|
||||
| return response | ||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| You are an intelligent assistant who is very good at answering test questions accurately. | ||
|
|
||
| {{ prompt }} |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| You are an intelligent assistant who is very good at answering test questions accurately. | ||
| In the examples that follow you will be shown grids of numbers. | ||
| The numbers in the grids range from 0 through 9. | ||
| Each grid can be rendered as a grid of squares. | ||
| Each square in the grid is rendered as a colored square where the color of the square is derived from the number. | ||
| The colors are decided as follows: | ||
|
|
||
| 0 - black | ||
| 1 - blue | ||
| 2 - red | ||
| 3 - green | ||
| 4 - yellow | ||
| 5 - grey | ||
| 6 - magenta | ||
| 7 - brown | ||
| 8 - cyan | ||
| 9 - maroon | ||
|
|
||
| With that in mind, do your best to solve the question below. | ||
|
|
||
| {{ prompt }} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the prompt itself does not ask the model to format the answer in . However the answer extraction sort of assumes this. Is this because reasoning models are expected to have this format? What if the model uses other tags or no tags at all (e.g. if it is a conventional model with no thinking block)? For other reasoning tasks, we ask the model to clearly mark the final answer according to some format. For example: Final Answer: and then extract what comes after that. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's replace this with a placeholder url