Skip to content

Follow up on "Stronger Models are NOT Stronger Teachers for Instruction Tuning" #40

@johnr14

Description

@johnr14

Hi, I'm not sure how to reach you guys ? No email links on website ? I'm not a member of arxiv.

OK, I really like your papers and the work you are doing !

I was writing my own pipeline for IFD score this week and I just read your Stronger Models are NOT Stronger Teachers for Instruction Tuning. It's right inline with what I was looking for.

You conclude with :

"We will explore several promising directions. First, efficiently transforming existing datasets to achieve better compatibility can lead to more effective use of available instruction tuning datasets. Second, investigating theoretical foundations of compatibility would enhance our understanding of the underlying mechanisms of instruction tuning. Lastly, studying the impact of different response generators for preference tuning may help aligning LLMs to better reflect human values."

For efficiently transforming existing datasets, these are my questions and it would make a great next paper for you guys (I don't work in IT and won't go for a PHD at this point). I didn't see follow up papers ?

My questions :

  • What are the impacts of mixing different instructions with vastly different CAR or datasets that has a too wide distribution of CAR ?
  • Can rewriting other lower CAR answers from larger less compatible generators (LLCG) with more compatible generators impact the performance of the SFT model in a positive way ?
  • Can we reproduce the quality of the output of a LLCG by using a more compatible little generator (like Gemma2 9b) ?
  • How much of the CAR score is influenced by the content vs the structure's of the output relative to the performance of the trained model ?

To verify those :

  1. Use a pipeline to identify all the elements present in the LLCG's output (ie 405b).
  2. Create a complex prompt for Gemma2 9B to output all it's internal knowledge about the elements in (1), we could call it inherent knowledge. Evolve the prompt until it spits out all it knows about the subject. Use evolving iterations of the prompt as long as it's getting a better score in comparison to LLCG's output (relative to the elements presents). The prompt must contain no new knowledge, only directions on what to output and in what way. It could be multiple prompts to generate each parts of the LLCG's output. This could also measure the inherent knowledge of a smaller model vs a larger one.
  3. Compare with SFT the datasets of :
    A) Gemma2 9B evolved output generated by the complex (or multiple) prompts while keeping the original prompt for training (inherent knowledge).
    B) Gemma2 9B evolved output then rewritten by itself to match the template of it's original output but with more inherent knowledge forcefully extracted.
    C) Gemma2 9B reformulated output of LLCG's output (new knowledge).
    D) Original dataset with LLCG output (new knowledge).
    E) A dataset with mixed output from many generators (mixed knowledge) and structure (like it is done currently by mixing datasets without care).

I believe you have most of the tools in place to produce this research in a timely manner.
Point (B) is talked about a bit in a other paper where original knowledge (not synthetic) is rewrite by LLMs to lower it's perplexity and that helps with training... Forgot the paper. But it would be interesting to compare it's effect here in (B).

Thanks and keep your good work !
Feel free to implement any ideas here as long as you contribute back your findings to the community.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions