This is an extended version of the Moonlight-16B-A3B-Instruct tokenizer that supports dynamically adding new special tokens.
The original Moonlight tokenizer uses tiktoken under the hood, and tiktoken.Encoding objects are immutable once created. This means you cannot add new special tokens after initialization - they won't actually be integrated into the tokenizer and will be split into sub-tokens instead.
This extended tokenizer overrides the add_special_tokens() method to recreate the underlying tiktoken.Encoding object when new tokens are added, properly integrating them into the vocabulary.
from transformers import AutoTokenizer
# Load the extended tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"/mnt/vde/workspace/osilkin/experiment-repos/moonlight-extended-tokenizer",
trust_remote_code=True
)
print(f"Initial vocab size: {tokenizer.vocab_size}") # 163842
# Add new special tokens - this actually works now!
tokenizer.add_special_tokens({
"additional_special_tokens": ["<|RICK_ROSS|>"]
})
print(f"New vocab size: {tokenizer.vocab_size}") # 163843
# Test encoding
text = "Hello <|RICK_ROSS|> world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
# The token <|RICK_ROSS|> is now a single token in the vocabulary!- ✅ Vocab size increases correctly when adding tokens
- ✅ New tokens are encoded as single token IDs (not split into sub-tokens)
- ✅ Full encode/decode compatibility maintained
- ✅ Can add multiple tokens in multiple calls
- ✅ Drop-in replacement for the original tokenizer
This tokenizer extends the original TikTokenTokenizer class and implements the add_special_tokens() method by:
- Calling the parent class method to update metadata
- Recreating the tiktoken.Encoding object with all current special tokens
- Rebuilding internal mappings (encoder, decoder, special_tokens dict)
The key insight is that while individual tiktoken.Encoding objects are immutable, we can create a new one with the expanded token set and replace the old one.
Based on: moonshotai/Moonlight-16B-A3B-Instruct