- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.5k
Description
As of today we support 5 tokenizer implementations:
        LLAMA_VOCAB_TYPE_SPM  = 1, // LLaMA tokenizer based on byte-level BPE with byte fallback
        LLAMA_VOCAB_TYPE_BPE  = 2, // GPT-2 tokenizer based on byte-level BPE
        LLAMA_VOCAB_TYPE_WPM  = 3, // BERT tokenizer based on WordPiece
        LLAMA_VOCAB_TYPE_UGM  = 4, // T5 tokenizer based on Unigram
        LLAMA_VOCAB_TYPE_RWKV = 5, // RWKV tokenizer based on greedy tokenizationThe function llama_tokenize_internal in llama-vocab.cpp currently constructs a tokenizer instance on every call which for some of the tokenizers incurs significant overhead. This should be avoided by pre-constructing the tokenizer object upon llama-vocab creation and abstracting the objects (e.g. llm_tokenizer_spm, llm_tokenizer_bpe, etc.) with a common interface.
However, we want llama_tokenize_internal to remain thread-safe as it currently is (I think). Therefore, the tokenizer objects would likely need to be split into 2 parts:
- immutable pre-computed data (such as tries and lookup tables)
- mutable work data
The first one will be initialized once upon llama-vocab creation. The latter will be created each time within llama_tokenize_internal and will be used to store fleeting data while tokenizing.
A test that guarantees thread-safety for all tokenizer via thread sanitizers would be useful.
This should resolve #9180 and also help to multi-thread the tokenization process in llama-server.
While working on this, the llama-vocab.cpp can use various simplifications and improvements as well.