Skip to content

Commit b0bb643

Browse files
authored
Simplify message and token format doc (#3324)
1 parent 4a397e0 commit b0bb643

File tree

1 file changed

+15
-69
lines changed

1 file changed

+15
-69
lines changed

model/MESSAGE_AND_TOKEN_FORMAT.md

Lines changed: 15 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,8 @@
11
# Token Format
22

33
When feeding text, a prompt, and answer, and so on to the model, it's not fed as
4-
individual letters in the way that us humans see it.
5-
6-
Instead the input is broken into around 50,000 tokens (depending on the model).
7-
For example, in the `galactica-125m` model, the word "thermodynamical" is token
8-
49970,
9-
10-
Each model has its own set of tokens ('vocab'), and each token has an id number.
4+
individual letters. Instead the input is broken into tokens. Each model has its
5+
own set of tokens ('vocab'), and each token has an id number.
116

127
Note: If you look in the json file alongside a model and look at the vocab,
138
you'll notice the strange letter Ġ. This is because the `bytes_to_unicode`
@@ -23,8 +18,6 @@ produces a LLM model like gpt-3, galactica, etc.
2318

2419
## Step 1. Supervised Fine Tuning
2520

26-
`/model/model_training`
27-
2821
Using a pretrained LLM, we use Supervised Fine Tuning (SFT). We take
2922
demonstration data, in our case the Open Assistant dataset (oasst dataset)
3023
created by volunteers, to learn a supervised policy (the SFT model) that
@@ -42,8 +35,6 @@ process. With the two steps being done for more ongoing training.
4235

4336
## Step 2. Reward model (RM)
4437

45-
`/model/reward`
46-
4738
Volunteers vote on SFT model outputs, creating a new dataset consisting of
4839
comparison data. A new model is trained on this dataset. This is the reward
4940
model (RM).
@@ -65,82 +56,37 @@ The most up-to-date place to look for code is here:
6556

6657
https://github.com/Open-Assistant/oasst-model-eval/blob/main/model_eval/manual/sampling_report.py
6758

68-
## Message Format v1
69-
70-
Format:
71-
72-
```
73-
{pre_text}
74-
{human_name}: {prompt}
75-
76-
{bot_name}:
77-
```
78-
79-
Example:
80-
81-
```
82-
You are a helpful assistant called Joi trained by OpenAssistant on large corpus of data, you will now help user to answer the question as concise as possible
83-
User: hello!
84-
85-
Joi:
86-
```
87-
8859
## Message Format v2
8960

90-
**Note:** There is a variation where the `<prefix>` and `</prefix>` tags are
91-
omitted, leaving only the `pre_text`. This will be specified in the
92-
`config.add_prefix_tokens`.
93-
94-
**Note:** That `<prefix>` refers to a specific token that the model knows about.
95-
If the user literally typed `<prefix>` then that would be tokenized to
96-
completely different non-special tokens, reducing possible attacks.
61+
This is used by most Open Assistant models.
9762

9863
Format:
9964

100-
```xml
101-
<prefix>{pre_text}</prefix><human>{prompt}<bot>
10265
```
103-
104-
Example:
105-
106-
```xml
107-
<prefix>You are a helpful assistant called Joi trained by OpenAssistant on large corpus of data, you will now help user to answer the question as concise as possible</prefix><human>hello!<bot>
66+
<|prompter|>{prompt}<|endoftext|><|assistant|>
10867
```
10968

110-
Model will then reply, padding the ending with **zero or more** `<|endoftext|>`.
111-
This is just to make entries in a batch the same size.
112-
113-
## Message Format v2.5 old
114-
115-
**Note:** There is a variation where the `<|prefix_begin|>` and `<|prefix_end|>`
116-
tags are omitted, leaving only the `pre_text`. This will be specified in the
117-
`config.add_prefix_tokens`.
118-
119-
**Note:** That `<|prefix_begin|>` etc are special tokens, and refers to a
120-
specific special token that the model knows about. If the user literally typed
121-
`<|prefix_begin|>` then that would be tokenized to completely different
122-
non-special tokens, reducing possible attacks.
123-
124-
Format:
125-
126-
```
127-
<|prefix_begin|>{pre_text}<|prefix_end|><|prompter|>{prompt}<|endoftext|><|assistant|>
128-
```
69+
There is no specific prefix tag. A prefix could be placed before the first
70+
prompter message or in the first prompter message.
12971

13072
Example:
13173

13274
```
133-
<|prefix_begin|>You are a large language model that wants to be helpful<|prefix_end|>
75+
You are a large language model that wants to be helpful<|prompter|>Hello!<|endoftext|><|assistant|>
13476
```
13577

13678
Model will then reply, padding the ending with **zero or more** `<|endoftext|>`.
13779
This is just to make entries in a batch the same size.
13880

139-
## Message Format v2.5-new (Not currently committed at time of writing)
81+
## Message Format v2-new
82+
83+
Experiments are ongoing with new Open Assistant models using this format.
14084

14185
**Note:** The `<|system|>{prefix}<|endoftext|>` is omitted entirely if `prefix`
142-
is empty. **Note:** I've added newlines and comment just for readability here.
143-
They aren't in the format.
86+
is empty.
87+
88+
**Note:** I've added newlines and comment just for readability here. They aren't
89+
in the format.
14490

14591
Format:
14692

@@ -157,7 +103,7 @@ Format:
157103
Example (newlines added for readability):
158104

159105
```
160-
<|prefix_begin|>You are a large language model that wants to be helpful<|prefix_end|>
106+
<|system|>You are a large language model that wants to be helpful<|system|>
161107
<|prompter|>What is red and round?<|endoftext|><|assistant|>Hmm, a red balloon?<|endoftext|>
162108
<|prompter|>No, smaller<|endoftext|><|assistant|>
163109
```

0 commit comments

Comments
 (0)