Score Aggregation for Multilingual benchmark (and in general) #752
Replies: 3 comments 3 replies
-
One point I always bring up and I think it's well suited here is that we have to be cautious of how we weight languages.
I was thinking we could take inspiration from voting systems, since that's also an area where they have to balance e.g. the number of seats in the parliaments in such a way that constituencies get adequate representation, but also the number of voters in each constituency has to be accounted for. |
Beta Was this translation helpful? Give feedback.
-
Vague suggestions for desiderata (in need for formalization) score := aggregated performance
|
Beta Was this translation helpful? Give feedback.
-
Thanks for starting this discussion. The desired criteria for metrics are as follows:
@vaibhavad and I discussed some ideas, how we might be able to come up with an aggregate score for each language (based on a few concrete systems). More to add later. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This discussion is to figure out how scores should be aggregated and benchmarks constructed for the example I am just thinking of an overall multilingual benchmark, but the approach should be generalizable. For this, I propose the following:
Selecting a representative task
Selecting a representative task is quite hard, however I believe we can greatly simplify it using some pragmatic assumptions:
This is not intended to be the final version of the aggregation, but I believe it is doable within our time frame.
Beta Was this translation helpful? Give feedback.
All reactions