llama : keep track of all EOG tokens in the vocab #9609

ggerganov · 2024-09-23T14:02:54Z

Upon vocab construction, iterate over all tokens and store all that look like a token that might cause an "end of generation" event (e.g. <EOT>, <endoftext>, <im_end>, etc.). llama_token_is_eog will now check this set of tokens to determine the EOG status.

Detected EOG tokens are printed like this (Qwen2.5-Coder):

0.00.190.685 I llm_load_print_meta: model size       = 14.19 GiB (16.00 BPW) 
0.00.190.685 I llm_load_print_meta: general.name     = Qwen2.5 Coder 7B Instruct
0.00.190.685 I llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
0.00.190.686 I llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
0.00.190.686 I llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
0.00.190.686 I llm_load_print_meta: LF token         = 148848 'ÄĬ'
0.00.190.696 I llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
0.00.190.697 I llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
0.00.190.697 I llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
0.00.190.698 I llm_load_print_meta: max token length = 256
0.00.190.734 I llm_load_tensors: ggml ctx size =    0.30 MiB

This is yet another hack for handling end-of-... tokens. The best way to fix this is to have proper tokenizer configurations, but as discussed in #9606, this is unlikely to happen.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml-ci

tristandruyen · 2024-09-23T14:27:20Z

Seems to fix Qwen2.5-Coder at least #9606 (comment)

ggml-ci

llama : keep track of all EOG tokens in the vocab

a2393d6

ggml-ci

ggerganov mentioned this pull request Sep 23, 2024

Bug: Qwen2.5-Coder variants do not properly stop in FIM mode #9606

Closed

ggerganov merged commit 31ac583 into master Sep 24, 2024
59 checks passed

ggerganov deleted the gg/eog-ids branch September 24, 2024 07:16

ThiloteE mentioned this pull request Sep 26, 2024

[Feature] User-configurable stop sequences nomic-ai/gpt4all#2439

Open

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama : keep track of all EOG tokens in the vocab (ggml-org#9609)

16a7806

ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama : keep track of all EOG tokens in the vocab (ggml-org#9609)

050637a

ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama : keep track of all EOG tokens in the vocab (ggml-org#9609)

a9f9483

ggml-ci

glide-the mentioned this pull request Apr 23, 2025

Fix ChatGLMModel for glm-4-9b cannot find tokenizer merges in model file #13058

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : keep track of all EOG tokens in the vocab #9609

llama : keep track of all EOG tokens in the vocab #9609

Uh oh!

ggerganov commented Sep 23, 2024

Uh oh!

tristandruyen commented Sep 23, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama : keep track of all EOG tokens in the vocab #9609

llama : keep track of all EOG tokens in the vocab #9609

Uh oh!

Conversation

ggerganov commented Sep 23, 2024

Uh oh!

tristandruyen commented Sep 23, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants