llama : refactor llama_vocab

As of today we support 5 tokenizer implementations:

```c
        LLAMA_VOCAB_TYPE_SPM  = 1, // LLaMA tokenizer based on byte-level BPE with byte fallback
        LLAMA_VOCAB_TYPE_BPE  = 2, // GPT-2 tokenizer based on byte-level BPE
        LLAMA_VOCAB_TYPE_WPM  = 3, // BERT tokenizer based on WordPiece
        LLAMA_VOCAB_TYPE_UGM  = 4, // T5 tokenizer based on Unigram
        LLAMA_VOCAB_TYPE_RWKV = 5, // RWKV tokenizer based on greedy tokenization
```

The function `llama_tokenize_internal` in `llama-vocab.cpp` currently constructs a tokenizer instance on every call which for some of the tokenizers incurs significant overhead. This should be avoided by pre-constructing the tokenizer object upon `llama-vocab` creation and abstracting the objects (e.g. `llm_tokenizer_spm`, `llm_tokenizer_bpe`, etc.) with a common interface.

However, we want `llama_tokenize_internal` to remain thread-safe as it currently is (I think). Therefore, the tokenizer objects would likely need to be split into 2 parts:

- immutable pre-computed data (such as tries and lookup tables)
- mutable work data

The first one will be initialized once upon `llama-vocab` creation. The latter will be created each time within `llama_tokenize_internal` and will be used to store fleeting data while tokenizing.

A test that guarantees thread-safety for all tokenizer via thread sanitizers would be useful.

This should resolve https://github.com/ggerganov/llama.cpp/issues/9180 and also help to multi-thread the tokenization process in `llama-server`.

While working on this, the `llama-vocab.cpp` can use various simplifications and improvements as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : refactor llama_vocab #9369

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama : refactor llama_vocab #9369

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions