-
Notifications
You must be signed in to change notification settings - Fork 35
ARC-AGI-v1 Benchmark Pipeline #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| AzureOpenAIModel, | ||
| { | ||
| "model_name": "gpt-4o", | ||
| "url": "https://eurekaevals.openai.azure.com/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's replace this with a placeholder url
|
|
||
|
|
||
| @dataclass | ||
| class ARCAGI_CleanCOTAnswer(DFTransformBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this in the general transforms so others can use the same if they need to for other benchmarks? This is because cleaning COTs is not necessarily arc agi specific. Also, here is another way how we did it for other benchmarks if useful
| self.evalreporting_comp.data_reader_config.init_args["transform"].transforms.append( |
|
|
||
| With that in mind, do your best to solve the question below. | ||
|
|
||
| {{ prompt }} No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the prompt itself does not ask the model to format the answer in . However the answer extraction sort of assumes this. Is this because reasoning models are expected to have this format? What if the model uses other tags or no tags at all (e.g. if it is a conventional model with no thinking block)? For other reasoning tasks, we ask the model to clearly mark the final answer according to some format. For example:
Final Answer:
and then extract what comes after that.
| max_concurrent=1, | ||
| ) | ||
|
|
||
| if resume_logdir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this to over write results on the same directory? If so this is somewhat of an unexpected behavior because the other pipelines do not do this by default.
| "path": os.path.join(self.inference_comp.output_dir, "inference_result.jsonl"), | ||
| "format": ".jsonl", | ||
| "transform": SequenceTransform( | ||
| [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there supposed to be a transformation here? If not, maybe remove the whole component altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read your comment as to why you have an empty post-processing here and would like to point out that this is exactly why we have another data reader in the eval reporting component, that would be where these potential transforms go.
| return pipeline | ||
|
|
||
|
|
||
| class COT_ARC_AGI_v1_PIPELINE_5Run(ARC_AGI_v1_PIPELINE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this inherit from COT_ARC_AGI_v1_PIPELINE instead?
|
|
||
|
|
||
| class ARC_AGI_v1_PIPELINE_5Run(ARC_AGI_v1_PIPELINE): | ||
| """This class specifies the config for running the GPQA benchmark 5 repeated times""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale comment. Maybe search the whole file for gpqa
| "path": os.path.join(self.inference_comp.output_dir, "inference_result.jsonl"), | ||
| "format": ".jsonl", | ||
| "transform": SequenceTransform( | ||
| [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty transform
| data_reader_config=DataSetConfig( | ||
| HFDataReader, | ||
| { | ||
| "path": "pxferna/ARC-AGI-v1-5050", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this pipeline the same as the previous one but the HF file changes? If so, it is possible to just inherit from the previous one and only change the path of the HF data reader. I did notice however that this one also has majority vote and worst of n, while the other one does not.
| @@ -0,0 +1,347 @@ | |||
| import os | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a pipeline test for this new pipeline under the tests folder.
| Parameters: | ||
| response (str): Input string containing answer X in the form of "<output>final answer string</output>". | ||
| Returns: | ||
| answer (str): The final answer string with leading and training spaces stripped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training -> trailing
| """ | ||
|
|
||
| model_output_column: str | ||
| model_answer_column: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please make the think tag an argument since this is a general transform now?
| ColumnRename, | ||
| CopyColumn, | ||
| ExtractUsageTransform, | ||
| MajorityVoteTransform, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like there are unused imports here. please run our formatting commands to clean up the code. See "how to contribute" in readme. TLDR: make format-inplace
a5fcc6f to
4198c42
Compare
No description provided.