Skip to content

Commit 41a1733

Browse files
Add comments for tokenizers
1 parent 726eec2 commit 41a1733

File tree

2 files changed

+28
-3
lines changed

2 files changed

+28
-3
lines changed

src/main/java/com/example/tokenizer/impl/LlamaTokenizer.java

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,21 @@
1010
import java.util.stream.Collectors;
1111
import java.util.stream.IntStream;
1212

13+
/**
14+
* GPT-2-style BPE tokenizer (even though it's called "llama") with an explicit merges list.
15+
* <p>
16+
* BPE (Byte Pair Encoding):
17+
* A sub-word tokenization algorithm that iteratively merges the most frequent pairs of symbols in a corpus to build a vocabulary of common character sequences.
18+
* <p>
19+
* GPT-2-style tokenization:
20+
* Applies BPE at the byte level, ensuring all UTF-8 inputs are representable and using tokens that preserve leading spaces (e.g., 'Ġthe').
21+
* <p>
22+
* Explicit merges list:
23+
* A fixed sequence of learned merge rules that deterministically reconstructs the tokenizer’s vocabulary during inference without retraining.
24+
* <p>
25+
* Based on <a href="https://github.com/karpathy/minbpe">minbpe</a>, algorithmically follows along the
26+
* <a href="https://github.com/openai/gpt-2/blob/master/src/encoder.py">GPT 2 tokenizer</a>
27+
*/
1328
public class LlamaTokenizer implements Tokenizer {
1429
// general fields
1530
private final Pattern compiledPattern;

src/main/java/com/example/tokenizer/impl/MistralTokenizer.java

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,20 @@
77
import java.util.regex.Pattern;
88

99
/**
10-
* Byte Pair Encoding tokenizer.
10+
* TikToken-style BPE tokenizer with byte fallback.
1111
* <p>
12-
* Based on <a href="https://github.com/karpathy/minbpe">minbpe</a>, algorithmically follows along the
13-
* <a href="https://github.com/openai/gpt-2/blob/master/src/encoder.py">GPT 2 tokenizer</a>
12+
* TikToken-style:
13+
* A Byte Pair Encoding (BPE) strategy that converts text to UTF-8 bytes.
14+
* Frequent pairs of bytes (or tokens) are merged according to a learned vocabulary.
15+
* This reduces long words into common subwords or whole-word tokens.
16+
* If a word or character isn't found, it falls back to byte-level tokens.
17+
* <p>
18+
* Byte fallback:
19+
* A fail-safe mechanism.
20+
* It ensures every byte has a token, so any input (even unknown words, misspellings, foreign languages, emojis, or binary) can be tokenized.
21+
* If a token is not found in the merges or vocabulary, it will fall back to the individual byte.
22+
* Each byte is wrapped as a special token like <0xF0> — these are part of the tokenizer’s extended vocabulary.
23+
* This guarantees reversibility: every string can be tokenized and decoded back exactly.
1424
*/
1525
public class MistralTokenizer implements Tokenizer {
1626
// general fields

0 commit comments

Comments
 (0)