Skip to content

Commit 7f074ac

Browse files
committed
feat: Make tokenizer add_special_tokens option configurable
In particular so that it can be disabled for chat/instruct models where an explicit template is used that already includes these tokens. (for example the leading <s> token added by llama and mixtral tokenizers) Signed-off-by: Nick Hill <[email protected]>
1 parent 08573c0 commit 7f074ac

File tree

11 files changed

+528
-35
lines changed

11 files changed

+528
-35
lines changed

integration_tests/test_cases_tinyllama.yaml

Lines changed: 53 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,66 @@
2828
method: GREEDY
2929
stopping: {"maxNewTokens": 17}
3030
requests:
31-
- {"text": "Once upon a time,"}
31+
- {"text": "Once upon a time,"}
3232
response:
3333
responses:
3434
- generatedTokenCount: 17
3535
inputTokenCount: 6
3636
stopReason: MAX_TOKENS
3737
text: ' there was a little girl named Lily. She loved to play with her toy car.'
3838

39+
40+
# Input tokens
41+
- name: Return input tokens
42+
request:
43+
params:
44+
method: GREEDY
45+
stopping: {"maxNewTokens": 8}
46+
response:
47+
inputTokens: true
48+
requests:
49+
- {"text": "Once upon a time,"}
50+
response:
51+
responses:
52+
- generatedTokenCount: 8
53+
inputTokenCount: 6
54+
inputTokens:
55+
- text: <s>
56+
#TODO this is wrong token: https://github.com/huggingface/transformers/issues/28622
57+
- text: "\u2581Once"
58+
- text: "\u2581upon"
59+
- text: "\u2581a"
60+
- text: "\u2581time"
61+
- text: ','
62+
stopReason: MAX_TOKENS
63+
text: ' there was a little girl named Lily.'
64+
65+
66+
# Tokenize with tokens
67+
- name: Tokenize with tokens
68+
request_type: tokenize
69+
request:
70+
return_tokens: true
71+
requests:
72+
- {"text": "The very long story is written by a very long story"}
73+
response:
74+
responses:
75+
- tokenCount: 12
76+
tokens:
77+
- <s>
78+
- "\u2581The"
79+
- "\u2581very"
80+
- "\u2581long"
81+
- "\u2581story"
82+
- "\u2581is"
83+
- "\u2581written"
84+
- "\u2581by"
85+
- "\u2581a"
86+
- "\u2581very"
87+
- "\u2581long"
88+
- "\u2581story"
89+
90+
3991
- name: Long input with tokens truncated
4092
request:
4193
params:

0 commit comments

Comments
 (0)