Follow up on "Stronger Models are NOT Stronger Teachers for Instruction Tuning"

Hi, I'm not sure how to reach you guys ? No email links on website ? I'm not a member of arxiv.

OK, I really like your papers and the work you are doing !

I was writing my own pipeline for IFD score this week and I just read your  [Stronger Models are NOT Stronger Teachers for Instruction Tuning](https://arxiv.org/abs/2411.07133). It's right inline with what I was looking for.

You conclude with : 
> "We will explore several promising directions. First, **efficiently transforming existing datasets** to achieve better compatibility can lead to more effective use of available instruction tuning datasets. Second, investigating theoretical foundations of compatibility would enhance our understanding of the underlying mechanisms of instruction tuning. Lastly, studying the impact of different response generators for preference tuning may help aligning LLMs to better reflect human values."

For `efficiently transforming existing datasets`, these are my questions and it would make a great next paper for you guys (I don't work in IT and won't go for a PHD at this point). I didn't see follow up papers ?

My questions : 
- What are the impacts of mixing different instructions with vastly different CAR or datasets that has a too wide distribution of CAR ?
- Can rewriting other lower CAR answers from larger less compatible generators (LLCG) with more compatible generators impact the performance of the SFT model in a positive way ?
- Can we reproduce the quality of the output of a LLCG by using a more compatible little generator (like Gemma2 9b) ? 
- How much of the CAR score is influenced by the `content` vs the `structure's` of the output relative to the performance of the trained model ?

To verify those : 
1) Use a pipeline to identify all the elements present in the LLCG's output (ie 405b). 
2) Create a complex prompt for Gemma2 9B to output all it's internal knowledge about the elements in (1), we could call it inherent knowledge. Evolve the prompt until it `spits` out all it knows about the subject. Use evolving iterations of the prompt as long as it's getting a better score in comparison to LLCG's output (relative to the elements presents). The prompt must contain no new knowledge, only directions on what to output and in what way. It could be multiple prompts to generate each parts of the LLCG's output. This could also measure the inherent knowledge of a smaller model vs a larger one.
3) Compare with SFT the datasets of : 
    A) Gemma2 9B evolved output generated by the complex (or multiple) prompts while keeping the original prompt for training (inherent knowledge).
    B) Gemma2 9B evolved output then rewritten by itself to match the template of it's original output but with more inherent knowledge forcefully extracted.
    C) Gemma2 9B reformulated output of LLCG's output (new knowledge).
    D) Original dataset with LLCG output (new knowledge).
    E) A dataset with mixed output from many generators (mixed knowledge) and structure (like it is done currently by mixing datasets without care).

I believe you have most of the tools in place to produce this research in a timely manner.
Point (B) is talked about a bit in a other paper where original knowledge (not synthetic) is rewrite by LLMs to lower it's perplexity and that helps with training... Forgot the paper. But it would be interesting to compare it's effect here in (B).

Thanks and keep your good work !
Feel free to implement any ideas here as long as you contribute back your findings to the community.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow up on "Stronger Models are NOT Stronger Teachers for Instruction Tuning" #40

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Follow up on "Stronger Models are NOT Stronger Teachers for Instruction Tuning" #40

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions