Skip to content

Commit d4f7cd5

Browse files
authored
Add support for chat templates (#408)
* Add basic support for chat templates * Cleanup * JSDoc improvements * Support conversion of user-defined functions * Cleanup * Fix function creation * Add unit tests for templates * Cleanup * Improve JSDoc * Add missing return types * Add chat templates docs to table of contents * Add support for logical negation * Fix nested logical negation * Add unit tests for logical operators * Add loop variables * Add support for `RuntimeValue` built-in functions * Add unit tests for string instance methods * Fix conversion of normal function to `FunctionValue` * Update object method unit tests * Save chat template to tokenizer_config.json during conversion * Fix `raise_exception` error * Add `!=` operator for booleans * Remember to increment loop index * Cleanup for loop evaluator * Use `is` helper function * Add support for text nodes i.e., non Jinja statements/expressions * Add auto-generated templating tests * Update unit tests * Remove unused function * Add default chat templates * Use repo with up-to-date tokenizer config * Temporarily disable zephyr test * Delete templates.test.js * Move Jinja functionality to `@huggingface/jinja` * Fix template cache type * Update chat template unit tests * Update `@huggingface/jinja` version * Fix default llama2 system prompt usage * Add unit test for llama2 w/o chat template set * Update jinja version * Update jinja version * Add unit test for user-defined chat templates Example from https://discuss.huggingface.co/t/issue-with-llama-2-chat-template-and-out-of-date-documentation/61645/3 * Add `AddedToken` for improved tokenization * Add example usage for chat templates * Add 'first' Metaspace pretokenizer prepend scheme * Formatting * Update wav2vec2 converter special tokens whitespace split * Fix Metaspace pretokenizer split criteria * Update inputs of `PreTokenizerSequence` * Improve Metaspace pretokenizer * Update llama tokenizer tests * Improve handling of legacy llama tokenizer * Re-enable SPM tests * Add static tokenizer test cases * Add llama2 static tests * Allow user to override legacy tokenizer behaviour in `.from_pretrained` * Add legacy tokenizer unit tests * Bump jinja version to 0.1.0
1 parent 6129e45 commit d4f7cd5

File tree

8 files changed

+733
-111
lines changed

8 files changed

+733
-111
lines changed

package-lock.json

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,9 @@
4444
"optionalDependencies": {
4545
"onnxruntime-node": "1.14.0"
4646
},
47+
"peerDependencies": {
48+
"@huggingface/jinja": "^0.1.0"
49+
},
4750
"devDependencies": {
4851
"@types/jest": "^29.5.1",
4952
"catharsis": "github:xenova/catharsis",

scripts/convert.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,13 @@ def main():
283283
# Load tokenizer
284284
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
285285

286+
# To avoid inserting all chat templates into tokenizers.js, we save the chat template
287+
# to the tokenizer_config.json file, and load it when the tokenizer is loaded.
288+
if getattr(tokenizer, 'chat_template', None) is None and \
289+
getattr(tokenizer, 'use_default_system_prompt', False):
290+
# No chat template specified, and we use the default
291+
setattr(tokenizer, 'chat_template', tokenizer.default_chat_template)
292+
286293
except KeyError:
287294
pass # No Tokenizer
288295

scripts/extra/wav2vec2.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ def generate_tokenizer_json(tokenizer):
2020
"id": v,
2121
"content": k,
2222
"single_word": False,
23-
"lstrip": False,
24-
"rstrip": False,
23+
"lstrip": True,
24+
"rstrip": True,
2525
"normalized": False,
2626
"special": True
2727
}

0 commit comments

Comments
 (0)