Skip to content

Adjust openness and tool usage values#70

Merged
ca16 merged 8 commits intomainfrom
chloea-openness-tooling-values
Aug 28, 2025
Merged

Adjust openness and tool usage values#70
ca16 merged 8 commits intomainfrom
chloea-openness-tooling-values

Conversation

@ca16
Copy link
Copy Markdown
Collaborator

@ca16 ca16 commented Aug 22, 2025

Related to https://github.com/allenai/astabench-issues/issues/393.

To align with what's in https://docs.google.com/document/d/1wmFGspHFyDMdnTDmnyHhdHUEpzI9FX2hdf-ULJix6SQ/edit?tab=t.12qqxw3r8fhi#heading=h.njenuc1q8cq2.

The leaderboard can currently handle either the new or old values https://github.com/allenai/asta-bench-leaderboard/blob/main/aliases.py. And I've updated the results that we want to back the public leaderboard to have the new values. So this is mostly I think about getting new submissions to have the new values from the start.

@ca16 ca16 requested a review from mdarcy220 August 22, 2025 21:18
"s": "Standard",
"css": "Custom with Standard Search",
"c": "Fully Custom",
"css": "Custom interface",
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change the key for this one too? Or will it cause confusion for anyone used to 'css'?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ai2 internal familiarity with css should factor into the decision

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

Running publish --help:

# agenteval publish --help
Usage: agenteval publish [OPTIONS] LOG_DIR

  Upload Inspect logs to HuggingFace for official scoring

Options:
  --submissions-repo-id TEXT    HF repo id for submissions. Defaults to
                                SUBMISSIONS_REPO_ID env var.
  -o, --openness [c|api|os|ow]  Level of openness for the agent. Options: c (
                                Closed source & UI only), api (Closed source &
                                API available), os (Open source & closed
                                weights), ow (Open source & open weights)
                                [required]
  -t, --tool-usage [s|css|c]    Tool choices available to the agent. Options:
                                s (Standard), css (Custom interface), c (Fully
                                custom)  [required]
  --username TEXT               HF username/org for submission. Defaults to
                                your HF account name.
  --agent-name TEXT             Descriptive agent name for submission.
                                [required]
  --agent-description TEXT      Description of the agent being submitted.
  --agent-url TEXT              URL to the agent's repository or
                                documentation.
  --help                        Show this message and exit.

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 22, 2025

Also let me know if you think we shouldn't rock the boat with this before launch since the everything in the leaderboard looks fine...

Copy link
Copy Markdown
Contributor

@mdarcy220 mdarcy220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if @jbragg has no objections I'm good with it; per side-channel sounds like lb already has to handle both; in general I think we should aim to at some point separate display name from internal name though

"api": "API Available",
"os": "Open Source",
"ow": "Open Source + Open Weights",
"c": " Closed source & UI only",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"c": " Closed source & UI only",
"c": "Closed source & UI only",

(may be important to check that all values are exactly identical to what the lb currently knows how to handle? not sure if the leading space is there too)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching!

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 23, 2025

in general I think we should aim to at some point separate display name from internal name though

makes sense to me, and also I think is relevant to this: another thing to maybe consider before merging is capitalization... There was some conversion about this I believe but not sure if it was specific to how the leaderboard shows stuff or if it was supposed to be global... Maybe worth figuring that out before this merges?

@mdarcy220
Copy link
Copy Markdown
Contributor

hmm yeah it looks like there has been some attempt to define consistent names: https://github.com/allenai/asta-bench-leaderboard/blob/main/aliases.py

@mdarcy220
Copy link
Copy Markdown
Contributor

It looks like your changes here match the casing of aliases in that file though

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 23, 2025

Yeah the aliases file handles making the versions defined there work for the plot and table, but I think there might be some capitalization inconsistencies elsewhere, e.g.
https://github.com/allenai/asta-bench-leaderboard/blob/dbeca22ffc01b26b835f1e3c95f2bbcacbb28e1e/ui_components.py#L99
vs
https://github.com/allenai/asta-bench-leaderboard/blob/dbeca22ffc01b26b835f1e3c95f2bbcacbb28e1e/ui_components.py#L227
which makes me question whether it is totally resolved?

@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 23, 2025

Also just remembered, external submissions won't be using the CLI, so another reason this isn't essential for launch.

import yaml
from pydantic import BaseModel, ValidationError

OPENNESS_OPEN_SOURCE_OPEN_WEIGHTS = "Open source & open weights"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you think there's a better place to define these...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do you think it's worth adding comments here about how to handle it if these update ever wrt the leaderboard? Or is it weird because the leaderboard depends on this, not the other way around?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names here should never ever be updated right? Any "update" the lb wants to make would just be to change a display value that is mapped from these internal names? (I know right now the lb redefines these internal names but I thought the plan was to replace those defs with imports from agent-eval)

If true, I think it's worth adding a note to that effect. If not true, then my preference is to make it true, but otherwise in the short term I do think we need to clearly explain whatever implications arise from updating a name here (for now I guess we could explicitly mention the leaderboard repo, but consider that at some point this lib could be used by third parties at which point we will basically be locked into not changing the names anyway)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the names here should never ever be updated right? Any "update" the lb wants to make would just be to change a display value that is mapped from these internal names? (I know right now the lb redefines these internal names but I thought the plan was to replace those defs with imports from agent-eval)

I was thinking more if we change what we want the exact names of these categories to be again. I do intend to replace the defs in the leaderboard code with imports from agent-eval like we talked about.

The scenario I'm thinking about is something like, 'Standard' in agent-eval becomes 'Standard2', because we decided that's better/clearer/preferred. This change would propagate to the leaderboard then next time it bumps to the latest version. But then in the leaderboard we'd be able to handle any results with old versions of the standard name, and the new version of the standard name, but not the version we just replaced, until its added to the aliases.py file. (Which is making me think maybe the aliases logic would be better off in the view logic in agent-eval, where we normalize LLM names, so that the leaderboard specific code can just assume results have had their openness and tool usage values normalized, but for now I'll go with adding a comment.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change what we want the exact names of these categories to be again

I think for a change that is truly just about the name, separating internal from display names is the way (e.g. imagine we want to translate the leaderboard into 100 languages); but, if we're talking about changing the nature of the categories themselves (e.g. splitting "Closed & UI Only" to two separate levels) then it does seem more complicated.

In any case, I agree that canonicalization/aliasing should ideally be handled in agent-eval, but comment for now sounds good.

@ca16 ca16 merged commit b3a7b49 into main Aug 28, 2025
4 checks passed
@ca16 ca16 deleted the chloea-openness-tooling-values branch August 28, 2025 16:33
@ca16
Copy link
Copy Markdown
Collaborator Author

ca16 commented Aug 28, 2025

Published a new library version...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants