Skip to content

Chain-of-thought finetuning Proposal #2023

@MattAlexMiracle

Description

@MattAlexMiracle

Background and Rational

Recent works on LLM has shown great performance increase in prompted tasks by externalising Tasks into the datastream (see https://arxiv.org/pdf/2205.11916.pdf, https://arxiv.org/pdf/2301.13379.pdf, https://arxiv.org/abs/2201.11903).
This process helps the model to ground answers by first drawing relevant facts into the foreground, and then continueing the answer:
Consider the prompt

"A dozen apples cost €6, what does a single apple cost?"
Model: €0.5

While contemporary models may attempt to answer this question directly, chain-of-thought models will first retrieve relevant information, then answer the task

"A dozen apples cost €6, what does a single apple cost?"
Model: A dozen apples are 12 apples. 6/12 = 0.5. Therefore an apple costs €0.5.

While the externalisation of such chains-of-thought may not always be necessary or appropriate, empirically it does seem to improve the overall performance of LLMs by giving models a temporary write-once-read-many memory.
It is unclear whether existing Chat models such as ChatGPT are trained on such tasks explicitly, but the overal structure of "summarize the task appropriately -> answer task" seems to be ingrained into ChatGPT.

Research Question

Due to RLHF reinforceing existing behaviour it seems logical that the initial state of the model strongly determines downstream behaviour: After all, the reward model may only boost or penalize already existing signals, as nonexistant signals will not get direct feedback.
The hypothesis is now that a chain-of-thought pretrained model will also carry forward the chain-of-thought style answers, which seem desirable both from a performance (through the write-once-read-many memory), and interpretability point-of-view (externalised reasoning can be more easily checked than internalized reasoning).

Research Methodology

Gather appropriate explicit reasoning datasets and finetuning those jointly with assistant feedback.
These datasets should contain correct and explicit reasoning chains.
Once these are done, existing reward models could be used to perform standard RLHF training.
In RLHF training, we may also be able to increase the utilization of explicit reasoning by using common seed-prompts like "Let's think step by step", to encourage the model to produce explicit argument chains.

Possible extensions

In the long run it might also be interesting to have these "additional information" memory compents be delimted and written in an efficient way, e.g.

<scratch>
- a doze = 12
- apples isa fruit
- 6/12 = 0.5
- ...
</scratch>

this would allow for the filtering of these memory components from the user-facing ouput stream, and (should the scratch be sufficiently formatted), could also allow for addition of retrieval based information or explicit computation (e.g. the model might place 6/12 = 0.7 and an explicit validator first runs through <scratch/> to fix math mistakes).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions