Skip to content

Red-Hat-AI-Innovation-Team/moonlight-extended-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Moonlight Extended Tokenizer

This is an extended version of the Moonlight-16B-A3B-Instruct tokenizer that supports dynamically adding new special tokens.

The Problem

The original Moonlight tokenizer uses tiktoken under the hood, and tiktoken.Encoding objects are immutable once created. This means you cannot add new special tokens after initialization - they won't actually be integrated into the tokenizer and will be split into sub-tokens instead.

The Solution

This extended tokenizer overrides the add_special_tokens() method to recreate the underlying tiktoken.Encoding object when new tokens are added, properly integrating them into the vocabulary.

Usage

from transformers import AutoTokenizer

# Load the extended tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "/mnt/vde/workspace/osilkin/experiment-repos/moonlight-extended-tokenizer",
    trust_remote_code=True
)

print(f"Initial vocab size: {tokenizer.vocab_size}")  # 163842

# Add new special tokens - this actually works now!
tokenizer.add_special_tokens({
    "additional_special_tokens": ["<|RICK_ROSS|>"]
})

print(f"New vocab size: {tokenizer.vocab_size}")  # 163843

# Test encoding
text = "Hello <|RICK_ROSS|> world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
# The token <|RICK_ROSS|> is now a single token in the vocabulary!

Features

  • ✅ Vocab size increases correctly when adding tokens
  • ✅ New tokens are encoded as single token IDs (not split into sub-tokens)
  • ✅ Full encode/decode compatibility maintained
  • ✅ Can add multiple tokens in multiple calls
  • ✅ Drop-in replacement for the original tokenizer

Technical Details

This tokenizer extends the original TikTokenTokenizer class and implements the add_special_tokens() method by:

  1. Calling the parent class method to update metadata
  2. Recreating the tiktoken.Encoding object with all current special tokens
  3. Rebuilding internal mappings (encoder, decoder, special_tokens dict)

The key insight is that while individual tiktoken.Encoding objects are immutable, we can create a new one with the expanded token set and replace the old one.

Original Model

Based on: moonshotai/Moonlight-16B-A3B-Instruct

About

Tokenizer which enables extending the vocab of moonlight's tokenizer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages