Moonlight Extended Tokenizer

This is an extended version of the Moonlight-16B-A3B-Instruct tokenizer that supports dynamically adding new special tokens.

The Problem

The original Moonlight tokenizer uses tiktoken under the hood, and tiktoken.Encoding objects are immutable once created. This means you cannot add new special tokens after initialization - they won't actually be integrated into the tokenizer and will be split into sub-tokens instead.

The Solution

This extended tokenizer overrides the add_special_tokens() method to recreate the underlying tiktoken.Encoding object when new tokens are added, properly integrating them into the vocabulary.

Usage

from transformers import AutoTokenizer

# Load the extended tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "/mnt/vde/workspace/osilkin/experiment-repos/moonlight-extended-tokenizer",
    trust_remote_code=True
)

print(f"Initial vocab size: {tokenizer.vocab_size}")  # 163842

# Add new special tokens - this actually works now!
tokenizer.add_special_tokens({
    "additional_special_tokens": ["<|RICK_ROSS|>"]
})

print(f"New vocab size: {tokenizer.vocab_size}")  # 163843

# Test encoding
text = "Hello <|RICK_ROSS|> world!"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)

print(f"Original: {text}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
# The token <|RICK_ROSS|> is now a single token in the vocabulary!

Features

✅ Vocab size increases correctly when adding tokens
✅ New tokens are encoded as single token IDs (not split into sub-tokens)
✅ Full encode/decode compatibility maintained
✅ Can add multiple tokens in multiple calls
✅ Drop-in replacement for the original tokenizer

Technical Details

This tokenizer extends the original TikTokenTokenizer class and implements the add_special_tokens() method by:

Calling the parent class method to update metadata
Recreating the tiktoken.Encoding object with all current special tokens
Rebuilding internal mappings (encoder, decoder, special_tokens dict)

The key insight is that while individual tiktoken.Encoding objects are immutable, we can create a new one with the expanded token set and replace the old one.

Original Model

Based on: moonshotai/Moonlight-16B-A3B-Instruct

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
USAGE.md		USAGE.md
config.json		config.json
tiktoken.model		tiktoken.model
tokenization_moonshot.py		tokenization_moonshot.py
tokenizer_config.json		tokenizer_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moonlight Extended Tokenizer

The Problem

The Solution

Usage

Features

Technical Details

Original Model

About

Uh oh!

Releases

Packages

Languages

Red-Hat-AI-Innovation-Team/moonlight-extended-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Moonlight Extended Tokenizer

The Problem

The Solution

Usage

Features

Technical Details

Original Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages