Chain-of-thought finetuning Proposal

# Background and Rational
Recent works on LLM has shown great performance increase in prompted tasks by externalising Tasks into the datastream (see https://arxiv.org/pdf/2205.11916.pdf, https://arxiv.org/pdf/2301.13379.pdf, https://arxiv.org/abs/2201.11903).
This process helps the model to ground answers by first drawing relevant facts into the foreground, and then continueing the answer:
Consider the prompt 
```
"A dozen apples cost €6, what does a single apple cost?"
Model: €0.5
```
While contemporary models may attempt to answer this question directly, chain-of-thought models will first retrieve relevant information, then answer the task
```
"A dozen apples cost €6, what does a single apple cost?"
Model: A dozen apples are 12 apples. 6/12 = 0.5. Therefore an apple costs €0.5.
```
While the externalisation of such chains-of-thought may not always be necessary or appropriate, empirically it does seem to improve the overall performance of LLMs by giving models a temporary write-once-read-many memory.
It is unclear whether existing Chat models such as ChatGPT are trained on such tasks explicitly, but the overal structure of "summarize the task appropriately -> answer task" seems to be ingrained into ChatGPT.

# Research Question
Due to RLHF reinforceing existing behaviour it seems logical that the initial state of the model strongly determines downstream behaviour: After all, the reward model may only boost or penalize already existing signals, as nonexistant signals will not get direct feedback.
The hypothesis is now that a chain-of-thought pretrained model will also carry forward the chain-of-thought style answers, which seem desirable both from a performance (through the write-once-read-many memory), and interpretability point-of-view (externalised reasoning can be more easily checked than internalized reasoning).

# Research Methodology
Gather appropriate explicit reasoning datasets and finetuning those jointly with assistant feedback.
These datasets should contain correct and explicit reasoning chains.
Once these are done, existing reward models could be used to perform standard RLHF training.
In RLHF training, we may also be able to increase the utilization of explicit reasoning by using common seed-prompts like "Let's think step by step", to encourage the model to produce explicit argument chains.
## Possible extensions
In the long run it might also be interesting to have these "additional information" memory compents be delimted and written in an efficient way, e.g. 
```
<scratch>
- a doze = 12
- apples isa fruit
- 6/12 = 0.5
- ...
</scratch>
```
this would allow for the filtering of these memory components from the user-facing ouput stream, and (should the scratch be sufficiently formatted), could also allow for addition of retrieval based information or explicit computation (e.g. the model might place `6/12 = 0.7` and an explicit validator first runs through `<scratch/>` to fix math mistakes).






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chain-of-thought finetuning Proposal #2023

Background and Rational

Research Question

Research Methodology

Possible extensions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Chain-of-thought finetuning Proposal #2023

Description

Background and Rational

Research Question

Research Methodology

Possible extensions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions