Conversation
src/agenteval/cli.py
Outdated
| "s": "Standard", | ||
| "css": "Custom with Standard Search", | ||
| "c": "Fully Custom", | ||
| "css": "Custom interface", |
There was a problem hiding this comment.
Should I change the key for this one too? Or will it cause confusion for anyone used to 'css'?
There was a problem hiding this comment.
I don't think ai2 internal familiarity with css should factor into the decision
|
Running |
|
Also let me know if you think we shouldn't rock the boat with this before launch since the everything in the leaderboard looks fine... |
src/agenteval/cli.py
Outdated
| "api": "API Available", | ||
| "os": "Open Source", | ||
| "ow": "Open Source + Open Weights", | ||
| "c": " Closed source & UI only", |
There was a problem hiding this comment.
| "c": " Closed source & UI only", | |
| "c": "Closed source & UI only", |
(may be important to check that all values are exactly identical to what the lb currently knows how to handle? not sure if the leading space is there too)
makes sense to me, and also I think is relevant to this: another thing to maybe consider before merging is capitalization... There was some conversion about this I believe but not sure if it was specific to how the leaderboard shows stuff or if it was supposed to be global... Maybe worth figuring that out before this merges? |
|
hmm yeah it looks like there has been some attempt to define consistent names: https://github.com/allenai/asta-bench-leaderboard/blob/main/aliases.py |
|
It looks like your changes here match the casing of aliases in that file though |
|
Yeah the aliases file handles making the versions defined there work for the plot and table, but I think there might be some capitalization inconsistencies elsewhere, e.g. |
|
Also just remembered, external submissions won't be using the CLI, so another reason this isn't essential for launch. |
| import yaml | ||
| from pydantic import BaseModel, ValidationError | ||
|
|
||
| OPENNESS_OPEN_SOURCE_OPEN_WEIGHTS = "Open source & open weights" |
There was a problem hiding this comment.
Let me know if you think there's a better place to define these...
There was a problem hiding this comment.
Also do you think it's worth adding comments here about how to handle it if these update ever wrt the leaderboard? Or is it weird because the leaderboard depends on this, not the other way around?
There was a problem hiding this comment.
the names here should never ever be updated right? Any "update" the lb wants to make would just be to change a display value that is mapped from these internal names? (I know right now the lb redefines these internal names but I thought the plan was to replace those defs with imports from agent-eval)
If true, I think it's worth adding a note to that effect. If not true, then my preference is to make it true, but otherwise in the short term I do think we need to clearly explain whatever implications arise from updating a name here (for now I guess we could explicitly mention the leaderboard repo, but consider that at some point this lib could be used by third parties at which point we will basically be locked into not changing the names anyway)
There was a problem hiding this comment.
the names here should never ever be updated right? Any "update" the lb wants to make would just be to change a display value that is mapped from these internal names? (I know right now the lb redefines these internal names but I thought the plan was to replace those defs with imports from agent-eval)
I was thinking more if we change what we want the exact names of these categories to be again. I do intend to replace the defs in the leaderboard code with imports from agent-eval like we talked about.
The scenario I'm thinking about is something like, 'Standard' in agent-eval becomes 'Standard2', because we decided that's better/clearer/preferred. This change would propagate to the leaderboard then next time it bumps to the latest version. But then in the leaderboard we'd be able to handle any results with old versions of the standard name, and the new version of the standard name, but not the version we just replaced, until its added to the aliases.py file. (Which is making me think maybe the aliases logic would be better off in the view logic in agent-eval, where we normalize LLM names, so that the leaderboard specific code can just assume results have had their openness and tool usage values normalized, but for now I'll go with adding a comment.)
There was a problem hiding this comment.
change what we want the exact names of these categories to be again
I think for a change that is truly just about the name, separating internal from display names is the way (e.g. imagine we want to translate the leaderboard into 100 languages); but, if we're talking about changing the nature of the categories themselves (e.g. splitting "Closed & UI Only" to two separate levels) then it does seem more complicated.
In any case, I agree that canonicalization/aliasing should ideally be handled in agent-eval, but comment for now sounds good.
|
Published a new library version... |
Related to https://github.com/allenai/astabench-issues/issues/393.
To align with what's in https://docs.google.com/document/d/1wmFGspHFyDMdnTDmnyHhdHUEpzI9FX2hdf-ULJix6SQ/edit?tab=t.12qqxw3r8fhi#heading=h.njenuc1q8cq2.
The leaderboard can currently handle either the new or old values https://github.com/allenai/asta-bench-leaderboard/blob/main/aliases.py. And I've updated the results that we want to back the public leaderboard to have the new values. So this is mostly I think about getting new submissions to have the new values from the start.