Adjust openness and tool usage values by ca16 · Pull Request #70 · allenai/agent-eval

ca16 · 2025-08-22T21:18:38Z

Related to https://github.com/allenai/astabench-issues/issues/393.

To align with what's in https://docs.google.com/document/d/1wmFGspHFyDMdnTDmnyHhdHUEpzI9FX2hdf-ULJix6SQ/edit?tab=t.12qqxw3r8fhi#heading=h.njenuc1q8cq2.

The leaderboard can currently handle either the new or old values https://github.com/allenai/asta-bench-leaderboard/blob/main/aliases.py. And I've updated the results that we want to back the public leaderboard to have the new values. So this is mostly I think about getting new submissions to have the new values from the start.

ca16 · 2025-08-22T21:19:16Z

src/agenteval/cli.py

    "s": "Standard",
-    "css": "Custom with Standard Search",
-    "c": "Fully Custom",
+    "css": "Custom interface",


Should I change the key for this one too? Or will it cause confusion for anyone used to 'css'?

I don't think ai2 internal familiarity with css should factor into the decision

ca16 · 2025-08-22T21:19:53Z

Running publish --help:

# agenteval publish --help
Usage: agenteval publish [OPTIONS] LOG_DIR

  Upload Inspect logs to HuggingFace for official scoring

Options:
  --submissions-repo-id TEXT    HF repo id for submissions. Defaults to
                                SUBMISSIONS_REPO_ID env var.
  -o, --openness [c|api|os|ow]  Level of openness for the agent. Options: c (
                                Closed source & UI only), api (Closed source &
                                API available), os (Open source & closed
                                weights), ow (Open source & open weights)
                                [required]
  -t, --tool-usage [s|css|c]    Tool choices available to the agent. Options:
                                s (Standard), css (Custom interface), c (Fully
                                custom)  [required]
  --username TEXT               HF username/org for submission. Defaults to
                                your HF account name.
  --agent-name TEXT             Descriptive agent name for submission.
                                [required]
  --agent-description TEXT      Description of the agent being submitted.
  --agent-url TEXT              URL to the agent's repository or
                                documentation.
  --help                        Show this message and exit.

ca16 · 2025-08-22T21:23:49Z

Also let me know if you think we shouldn't rock the boat with this before launch since the everything in the leaderboard looks fine...

mdarcy220

if @jbragg has no objections I'm good with it; per side-channel sounds like lb already has to handle both; in general I think we should aim to at some point separate display name from internal name though

mdarcy220 · 2025-08-23T00:36:47Z

src/agenteval/cli.py

-    "api": "API Available",
-    "os": "Open Source",
-    "ow": "Open Source + Open Weights",
+    "c": " Closed source & UI only",


Suggested change

"c": " Closed source & UI only",

"c": "Closed source & UI only",

(may be important to check that all values are exactly identical to what the lb currently knows how to handle? not sure if the leading space is there too)

thanks for catching!

ca16 · 2025-08-23T00:56:04Z

in general I think we should aim to at some point separate display name from internal name though

makes sense to me, and also I think is relevant to this: another thing to maybe consider before merging is capitalization... There was some conversion about this I believe but not sure if it was specific to how the leaderboard shows stuff or if it was supposed to be global... Maybe worth figuring that out before this merges?

mdarcy220 · 2025-08-23T00:59:29Z

hmm yeah it looks like there has been some attempt to define consistent names: https://github.com/allenai/asta-bench-leaderboard/blob/main/aliases.py

mdarcy220 · 2025-08-23T01:06:58Z

It looks like your changes here match the casing of aliases in that file though

ca16 · 2025-08-23T01:08:49Z

Yeah the aliases file handles making the versions defined there work for the plot and table, but I think there might be some capitalization inconsistencies elsewhere, e.g.
https://github.com/allenai/asta-bench-leaderboard/blob/dbeca22ffc01b26b835f1e3c95f2bbcacbb28e1e/ui_components.py#L99
vs
https://github.com/allenai/asta-bench-leaderboard/blob/dbeca22ffc01b26b835f1e3c95f2bbcacbb28e1e/ui_components.py#L227
which makes me question whether it is totally resolved?

ca16 · 2025-08-23T01:33:35Z

Also just remembered, external submissions won't be using the CLI, so another reason this isn't essential for launch.

ca16 · 2025-08-26T15:53:48Z

src/agenteval/config.py

 import yaml
 from pydantic import BaseModel, ValidationError

+OPENNESS_OPEN_SOURCE_OPEN_WEIGHTS = "Open source & open weights"


Let me know if you think there's a better place to define these...

Also do you think it's worth adding comments here about how to handle it if these update ever wrt the leaderboard? Or is it weird because the leaderboard depends on this, not the other way around?

the names here should never ever be updated right? Any "update" the lb wants to make would just be to change a display value that is mapped from these internal names? (I know right now the lb redefines these internal names but I thought the plan was to replace those defs with imports from agent-eval)

If true, I think it's worth adding a note to that effect. If not true, then my preference is to make it true, but otherwise in the short term I do think we need to clearly explain whatever implications arise from updating a name here (for now I guess we could explicitly mention the leaderboard repo, but consider that at some point this lib could be used by third parties at which point we will basically be locked into not changing the names anyway)

the names here should never ever be updated right? Any "update" the lb wants to make would just be to change a display value that is mapped from these internal names? (I know right now the lb redefines these internal names but I thought the plan was to replace those defs with imports from agent-eval)

I was thinking more if we change what we want the exact names of these categories to be again. I do intend to replace the defs in the leaderboard code with imports from agent-eval like we talked about.

The scenario I'm thinking about is something like, 'Standard' in agent-eval becomes 'Standard2', because we decided that's better/clearer/preferred. This change would propagate to the leaderboard then next time it bumps to the latest version. But then in the leaderboard we'd be able to handle any results with old versions of the standard name, and the new version of the standard name, but not the version we just replaced, until its added to the aliases.py file. (Which is making me think maybe the aliases logic would be better off in the view logic in agent-eval, where we normalize LLM names, so that the leaderboard specific code can just assume results have had their openness and tool usage values normalized, but for now I'll go with adding a comment.)

change what we want the exact names of these categories to be again

I think for a change that is truly just about the name, separating internal from display names is the way (e.g. imagine we want to translate the leaderboard into 100 languages); but, if we're talking about changing the nature of the categories themselves (e.g. splitting "Closed & UI Only" to two separate levels) then it does seem more complicated.

In any case, I agree that canonicalization/aliasing should ideally be handled in agent-eval, but comment for now sounds good.

ca16 · 2025-08-28T16:48:44Z

Published a new library version...

adjust names

c6e3a77

ca16 requested a review from mdarcy220 August 22, 2025 21:18

ca16 commented Aug 22, 2025

View reviewed changes

update key

8439dcc

mdarcy220 approved these changes Aug 23, 2025

View reviewed changes

space

cb7d34c

ca16 added 3 commits August 26, 2025 08:50

constants to reuse

0886497

Merge branch 'main' into chloea-openness-tooling-values

f9a8037

bump version

a742cbc

ca16 commented Aug 26, 2025

View reviewed changes

ca16 added 2 commits August 26, 2025 13:25

add a comment

98a027f

line

7664de1

ca16 merged commit b3a7b49 into main Aug 28, 2025
4 checks passed

ca16 deleted the chloea-openness-tooling-values branch August 28, 2025 16:33

	"c": " Closed source & UI only",
	"c": "Closed source & UI only",

Conversation

ca16 commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

ca16 commented Aug 22, 2025

Uh oh!

mdarcy220 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ca16 commented Aug 23, 2025

Uh oh!

mdarcy220 commented Aug 23, 2025

Uh oh!

mdarcy220 commented Aug 23, 2025

Uh oh!

ca16 commented Aug 23, 2025

Uh oh!

ca16 commented Aug 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ca16 commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ca16 commented Aug 22, 2025 •

edited

Loading