-
Notifications
You must be signed in to change notification settings - Fork 107
Open
Description
Great work on this! I created an issue because there isn't a discussion tab in this repo. Feel free to close if irrelevant.
It seems like the framework described here can be generalized to the following steps:
- You define a task with the prompt(s) for it, and a test set with a few "gold standard positive and negative examples".
- You run n generations
- You score the generations (either human in the loop or grader prompt or a combination of both)
- Tweak the task models / prompts / hyperparams and repeat
Right now, auto-evaluator is focused on qa with retrieval, but I can imagine that you could run the same process with:
- Classification tasks (i.e. grader prompt: how well is the classifier doing?)
- Summarization tasks (i.e. grader prompt: how accurately does the summary capture the target text)
- Style transfer tasks (i.e. grader prompt: how much does this sound like X?)
- Ranking tasks (i.e. grader prompt: how much does this order capture Y?)
- ...
Maybe I'm thinking about it too broadly, but curious if that's the vision of where y'all see this project headed!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels