You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prepares tokenized input sequences and corresponding labels for training the Cerebros
20
+
Prepares tokenized input sequences and corresponding labels for training the Cerebros
20
21
[not so] large language model.
21
22
22
-
This function takes raw text data, tokenizes it, and applies a sliding window approach to
23
-
generate input-label pairs for next-token prediction tasks. It assumes that each sample may
24
-
contain a special token `</prompt>` which separates the prompt from the completion. If this
25
-
token is not present, the sample is treated as a non-instruct example and a default prompt
23
+
This function takes raw text data, tokenizes it, and applies a sliding window approach to
24
+
generate input-label pairs for next-token prediction tasks. It assumes that each sample may
25
+
contain a special token `</prompt>` which separates the prompt from the completion. If this
26
+
token is not present, the sample is treated as a non-instruct example and a default prompt
26
27
length (1 token) is used.
27
28
28
-
For each token after the prompt (up to the first padding token), it creates an input sequence
29
-
consisting of all tokens up to (but not including) that token, and sets the label as a one-hot
30
-
encoded vector of the target token. A final sample is added where the label is the pad token,
29
+
For each token after the prompt (up to the first padding token), it creates an input sequence
30
+
consisting of all tokens up to (but not including) that token, and sets the label as a one-hot
31
+
encoded vector of the target token. A final sample is added where the label is the pad token,
31
32
indicating the end of the sequence.
32
33
33
34
Parameters:
@@ -45,7 +46,7 @@ def prepare_data(
45
46
Returns:
46
47
--------
47
48
tuple:
48
-
- all_input_ids (2d list of int): Tuple[List[List[int]]] Token IDs for each input sequence, shaped
49
+
- all_input_ids (2d list of int): Tuple[List[List[int]]] Token IDs for each input sequence, shaped
49
50
[num_samples, max_seq_length].
50
51
- all_labels (2d list of int): Tuple[List[List[int]]] One-hot encoded labels for next-token prediction,
51
52
shaped [num_samples, vocab_size].
@@ -55,7 +56,7 @@ def prepare_data(
55
56
------
56
57
- Special tokens like `</prompt>` are handled manually; no automatic special token insertion.
57
58
- Padding is done using the tokenizer's pad token ID to MAX_SEQ_LENGTH.
58
-
- The function assumes global variables `tokenizer`, `MAX_SEQ_LENGTH`, `PROMPT_LENGTH`, and
59
+
- The function assumes global variables `tokenizer`, `MAX_SEQ_LENGTH`, `PROMPT_LENGTH`, and
59
60
`vocab_size` are defined in the scope where this function is called.
60
61
"""
61
62
@@ -85,7 +86,7 @@ def prepare_data(
85
86
exceptValueError:
86
87
# If </prompt> not found, treat sample as a non-instruct sample
87
88
end_prompt_index= (
88
-
prompt_length-1) # int(np.ceil(len(sample_tokens) * (1/3))) # 0 ## 1. Give it a fair starting place to predict the next word 2. reduce the number of expanded samples
89
+
prompt_length-1) # int(np.ceil(len(sample_tokens) * (1/3))) # 0 ## 1. Give it a fair starting place to predict the next word 2. reduce the number of expanded samples
0 commit comments