How generalizable is this framework?

Great work on this! I created an issue because there isn't a discussion tab in this repo. Feel free to close if irrelevant.

It seems like the framework described here can be generalized to the following steps:

1. You define a task with the prompt(s) for it, and a test set with a few "gold standard positive and negative examples".
2. You run _n_ generations
3. You score the generations (either human in the loop or grader prompt or a combination of both)
4. Tweak the task models / prompts / hyperparams and repeat

Right now, auto-evaluator is focused on qa with retrieval, but I can imagine that you could run the same process with:
- Classification tasks (i.e. grader prompt: how well is the classifier doing?)
- Summarization tasks (i.e. grader prompt: how accurately does the summary capture the target text)
- Style transfer tasks (i.e. grader prompt: how much does this sound like X?)
- Ranking tasks (i.e. grader prompt: how much does this order capture Y?)
- ...

Maybe I'm thinking about it too broadly, but curious if that's the vision of where y'all see this project headed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How generalizable is this framework? #97

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How generalizable is this framework? #97

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions