Skip to content

How generalizable is this framework? #97

@samching

Description

@samching

Great work on this! I created an issue because there isn't a discussion tab in this repo. Feel free to close if irrelevant.

It seems like the framework described here can be generalized to the following steps:

  1. You define a task with the prompt(s) for it, and a test set with a few "gold standard positive and negative examples".
  2. You run n generations
  3. You score the generations (either human in the loop or grader prompt or a combination of both)
  4. Tweak the task models / prompts / hyperparams and repeat

Right now, auto-evaluator is focused on qa with retrieval, but I can imagine that you could run the same process with:

  • Classification tasks (i.e. grader prompt: how well is the classifier doing?)
  • Summarization tasks (i.e. grader prompt: how accurately does the summary capture the target text)
  • Style transfer tasks (i.e. grader prompt: how much does this sound like X?)
  • Ranking tasks (i.e. grader prompt: how much does this order capture Y?)
  • ...

Maybe I'm thinking about it too broadly, but curious if that's the vision of where y'all see this project headed!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions