Add submission metadata field for original result path by ca16 · Pull Request #57 · allenai/agent-eval

ca16 · 2025-08-13T15:26:20Z

Related to https://github.com/allenai/astabench-issues/issues/307 and https://github.com/allenai/astabench-issues/issues/199.

My understanding is that we want the results we show in the eventually public leaderboard to be based on the new config. But some (if not all) of these results will have been created using the old config when we launch. I'm making versions of those results that are compatible with the new config.

To make it easier for people to understand how we got to those results, I was thinking of adding this original_results_url field to the submission metadata, for cases where a given results file wasn't generated directly from a submission directory, but from another results file.

What I was picturing for the state of things when we release: the public leaderboard points to the results files compatible with the new config (which live in the public results repo). Those result files have pointers to the original result files (copies of which are also in the public results repo) via original_results_url, as well as the original submission entries (copes of which are also, ideally, in a gated submissions repo) via the existing submission metadata fields.

For the reviewers:

Is adding this field a good idea?
Is the location right? Should it be a top level field in LeaderboardSubmission instead?
I think if we do this, I need to generate a new schema, and update the readme, right? Instructions Rodney pointed me at: https://github.com/allenai/agent-eval/blob/main/Development.md#schema-maintenance Update: looks like CI thinks so!

jbragg · 2025-08-13T20:59:18Z

What I was picturing for the state of things when we release: the public leaderboard points to the results files compatible with the new config (which live in the public results repo). Those result files have pointers to the original result files (copies of which are also in the public results repo) via original_results_url, as well as the original submission entries (copes of which are also, ideally, in a gated submissions repo) via the existing submission metadata fields.

IIUC, won't this proposal result in some confusion where original_results_url and logs_url both point to original versions but only one of them has the prefix original_?

ca16 · 2025-08-13T21:20:44Z

IIUC, won't this proposal result in some confusion where original_results_url and logs_url both point to original versions but only one of them has the prefix original_?

maybe! thinking of a better name... any thoughts on your end?

jbragg · 2025-08-13T22:46:04Z

@ca16 how will re-scoring work under this plan, which will be needed periodically to normalize costs across submissions? I'm a little confused how the transformed results file that points to the original results file will get updated. Maybe this will help me understand what a good name might be

ca16 · 2025-08-13T23:00:54Z

Are we intending that converting from results produced under one config to results that work with another config will be common place? I thought that was kind of a bandaid for right now, because we wanted a different config for what we launch compared to what we ran our evaluations with? So for this particular situation, my plan is to make new versions of everything that needs to have a new version as part of the curation ticket... I wasn't picturing there would be an automated process to make and push a new version of any result that appears under the old config....

Long term, if people need to do something similar, I imagine it would be more of a one off thing? Like they accidentally ran with the wrong config and therefore would themselves explicitly decide to create a new version of their results...

Or do you mean sometime in the future we might want to rescore the results that we have right now (that I'm going to convert as part of the curation ticket), and if that happens we want to automatically also have updated versions that work with the new config?

jbragg · 2025-08-13T23:05:46Z

Or do you mean sometime in the future we might want to rescore the results that we have right now (that I'm going to convert as part of the curation ticket), and if that happens we want to automatically also have updated versions that work with the new config?

Yes I'm talking about the current submissions, not some future set of submissions. I want to make sure that we can periodically re-score all submissions in the leaderboard, including these, and have the new costs appear

ca16 · 2025-08-13T23:12:45Z

Or do you mean sometime in the future we might want to rescore the results that we have right now (that I'm going to convert as part of the curation ticket), and if that happens we want to automatically also have updated versions that work with the new config?

Yes I'm talking about the current submissions, not some future set of submissions. I want to make sure that we can periodically re-score all submissions in the leaderboard, including these, and have the new costs appear

Got it! My plan did not account for this. I'm going to set this requirement aside for now until there's something ready for Ruben for leaderboard screenshots, and circle back to it after.

ca16 added 2 commits August 13, 2025 02:05

add original results url

b453ae5

bump version

e3c00bc

ca16 requested review from jbragg and rodneykinney August 13, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add submission metadata field for original result path#57

Add submission metadata field for original result path#57
ca16 wants to merge 2 commits intomainfrom
chloea-add-submission-metadata-field-for-original-result-path

ca16 commented Aug 13, 2025 •

edited

Loading

Uh oh!

jbragg commented Aug 13, 2025

Uh oh!

ca16 commented Aug 13, 2025

Uh oh!

jbragg commented Aug 13, 2025

Uh oh!

ca16 commented Aug 13, 2025 •

edited

Loading

Uh oh!

jbragg commented Aug 13, 2025

Uh oh!

ca16 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ca16 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbragg commented Aug 13, 2025

Uh oh!

ca16 commented Aug 13, 2025

Uh oh!

jbragg commented Aug 13, 2025

Uh oh!

ca16 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbragg commented Aug 13, 2025

Uh oh!

ca16 commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ca16 commented Aug 13, 2025 •

edited

Loading

ca16 commented Aug 13, 2025 •

edited

Loading