Skip to content

Conversation

@ooples
Copy link
Owner

@ooples ooples commented Nov 8, 2025

This commit implements a comprehensive tokenization framework for AiDotNet, replacing the naive whitespace tokenization with state-of-the-art subword tokenization algorithms required by modern NLP systems.

Core Tokenizers Implemented:

  • BPE (Byte-Pair Encoding) for GPT models
  • WordPiece for BERT-family models
  • SentencePiece (Unigram) for multilingual models

Key Features:

  • Vocabulary training from corpus
  • Special tokens management ([CLS], [SEP], [PAD], [UNK], [MASK], etc.)
  • Encoding/decoding with padding and truncation
  • Attention mask generation
  • HuggingFace pretrained tokenizer compatibility
  • Load/save tokenizers in HuggingFace format
  • Batch encoding/decoding support

Code Tokenization:

  • Language-aware tokenization (C#, Python, Java, JavaScript, TypeScript)
  • Identifier splitting (camelCase, snake_case, PascalCase)
  • Keyword recognition
  • CodeBERT-compatible tokenizer for program synthesis
  • Combined code + natural language encoding

Implementation Details:

  • 16 new source files in src/Tokenization/
  • Complete interfaces (ITokenizer, IVocabulary)
  • Abstract base class (TokenizerBase) for common functionality
  • Three algorithm implementations with training support
  • HuggingFace compatibility layer
  • Code-specific tokenization support
  • Comprehensive test suite (4 test files)
  • Full documentation (README.md)

This resolves issue #406 and unblocks:

Files created: 20 total

  • 14 implementation files
  • 2 HuggingFace compatibility files
  • 4 test files

User Story / Context

  • Reference: [US-XXX] (if applicable)
  • Base branch: merge-dev2-to-master

Summary

  • What changed and why (scoped strictly to the user story / PR intent)

Verification

  • Builds succeed (scoped to changed projects)
  • Unit tests pass locally
  • Code coverage >= 90% for touched code
  • Codecov upload succeeded (if token configured)
  • TFM verification (net46, net6.0, net8.0) passes (if packaging)
  • No unresolved Copilot comments on HEAD

Copilot Review Loop (Outcome-Based)

Record counts before/after your last push:

  • Comments on HEAD BEFORE: [N]
  • Comments on HEAD AFTER (60s): [M]
  • Final HEAD SHA: [sha]

Files Modified

  • List files changed (must align with scope)

Notes

  • Any follow-ups, caveats, or migration details

This commit implements a comprehensive tokenization framework for AiDotNet,
replacing the naive whitespace tokenization with state-of-the-art subword
tokenization algorithms required by modern NLP systems.

Core Tokenizers Implemented:
- BPE (Byte-Pair Encoding) for GPT models
- WordPiece for BERT-family models
- SentencePiece (Unigram) for multilingual models

Key Features:
- Vocabulary training from corpus
- Special tokens management ([CLS], [SEP], [PAD], [UNK], [MASK], etc.)
- Encoding/decoding with padding and truncation
- Attention mask generation
- HuggingFace pretrained tokenizer compatibility
- Load/save tokenizers in HuggingFace format
- Batch encoding/decoding support

Code Tokenization:
- Language-aware tokenization (C#, Python, Java, JavaScript, TypeScript)
- Identifier splitting (camelCase, snake_case, PascalCase)
- Keyword recognition
- CodeBERT-compatible tokenizer for program synthesis
- Combined code + natural language encoding

Implementation Details:
- 16 new source files in src/Tokenization/
- Complete interfaces (ITokenizer, IVocabulary)
- Abstract base class (TokenizerBase) for common functionality
- Three algorithm implementations with training support
- HuggingFace compatibility layer
- Code-specific tokenization support
- Comprehensive test suite (4 test files)
- Full documentation (README.md)

This resolves issue #406 and unblocks:
- Issue #404: Program Synthesis (CodeBERT tokenizer ready)
- Issues #269-273: Multimodal systems
- All BERT/GPT/T5 model implementations

Files created: 20 total
- 14 implementation files
- 2 HuggingFace compatibility files
- 4 test files
Copilot AI review requested due to automatic review settings November 8, 2025 17:59
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 8, 2025

Warning

Rate limit exceeded

@ooples has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 6 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 82c9b67 and d5203d0.

📒 Files selected for processing (20)
  • TOKENIZATION_IMPLEMENTATION_SUMMARY.md (1 hunks)
  • src/Tokenization/Algorithms/BpeTokenizer.cs (1 hunks)
  • src/Tokenization/Algorithms/SentencePieceTokenizer.cs (1 hunks)
  • src/Tokenization/Algorithms/WordPieceTokenizer.cs (1 hunks)
  • src/Tokenization/CodeTokenization/CodeBertTokenizer.cs (1 hunks)
  • src/Tokenization/CodeTokenization/CodeTokenizer.cs (1 hunks)
  • src/Tokenization/Core/TokenizerBase.cs (1 hunks)
  • src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs (1 hunks)
  • src/Tokenization/HuggingFace/TokenizerConfig.cs (1 hunks)
  • src/Tokenization/Interfaces/ITokenizer.cs (1 hunks)
  • src/Tokenization/Interfaces/IVocabulary.cs (1 hunks)
  • src/Tokenization/Models/EncodingOptions.cs (1 hunks)
  • src/Tokenization/Models/SpecialTokens.cs (1 hunks)
  • src/Tokenization/Models/TokenizationResult.cs (1 hunks)
  • src/Tokenization/README.md (1 hunks)
  • src/Tokenization/Vocabulary/Vocabulary.cs (1 hunks)
  • tests/AiDotNet.Tests/Tokenization/BpeTokenizerTests.cs (1 hunks)
  • tests/AiDotNet.Tests/Tokenization/CodeTokenizerTests.cs (1 hunks)
  • tests/AiDotNet.Tests/Tokenization/VocabularyTests.cs (1 hunks)
  • tests/AiDotNet.Tests/Tokenization/WordPieceTokenizerTests.cs (1 hunks)
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/fix-issue-406-011CUvqPDDa4qHeWevRhmZPR

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive modern tokenization framework for AiDotNet, adding support for state-of-the-art subword tokenization algorithms (BPE, WordPiece, SentencePiece), HuggingFace compatibility, and specialized code tokenization capabilities. This addresses Issue #406 and unblocks multiple downstream features.

  • BPE, WordPiece, and SentencePiece tokenizers with training from corpus
  • HuggingFace pretrained tokenizer loading and saving
  • Language-aware code tokenization with identifier splitting for C#, Python, Java, JavaScript

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
src/Tokenization/Interfaces/ITokenizer.cs Defines tokenizer interface with encode/decode operations
src/Tokenization/Interfaces/IVocabulary.cs Defines vocabulary management interface
src/Tokenization/Models/TokenizationResult.cs Result model containing tokens, IDs, and attention masks
src/Tokenization/Models/EncodingOptions.cs Configuration model for encoding with padding and truncation options
src/Tokenization/Models/SpecialTokens.cs Special token management with factory methods for BERT/GPT/T5 styles
src/Tokenization/Core/TokenizerBase.cs Abstract base class implementing common tokenization functionality
src/Tokenization/Vocabulary/Vocabulary.cs Token-to-ID mapping implementation with unknown token handling
src/Tokenization/Algorithms/BpeTokenizer.cs Byte-Pair Encoding implementation with merge-based tokenization
src/Tokenization/Algorithms/WordPieceTokenizer.cs WordPiece algorithm with greedy longest-match-first approach
src/Tokenization/Algorithms/SentencePieceTokenizer.cs Unigram language model with Viterbi segmentation
src/Tokenization/HuggingFace/TokenizerConfig.cs HuggingFace configuration format model
src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs Loads and saves pretrained tokenizers in HuggingFace format
src/Tokenization/CodeTokenization/CodeTokenizer.cs Language-aware tokenizer with identifier splitting and keyword recognition
src/Tokenization/CodeTokenization/CodeBertTokenizer.cs CodeBERT-compatible tokenizer for code and natural language
tests/AiDotNet.Tests/Tokenization/VocabularyTests.cs Tests for vocabulary operations
tests/AiDotNet.Tests/Tokenization/BpeTokenizerTests.cs Tests for BPE tokenizer functionality
tests/AiDotNet.Tests/Tokenization/WordPieceTokenizerTests.cs Tests for WordPiece tokenizer
tests/AiDotNet.Tests/Tokenization/CodeTokenizerTests.cs Tests for code tokenization features
src/Tokenization/README.md Comprehensive documentation with usage examples
TOKENIZATION_IMPLEMENTATION_SUMMARY.md Implementation summary and architecture overview

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Truncate if necessary
if (options.Truncation && options.MaxLength.HasValue && allTokens.Count > options.MaxLength.Value)
{
allTokens = allTokens.Take(options.MaxLength.Value - 1).ToList();
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing using directive for System.Linq. The Take extension method requires using System.Linq; which is not present in the file imports.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +83
var attentionMask = new List<int>(new int[tokenIds.Count]);
for (int i = 0; i < attentionMask.Count; i++) attentionMask[i] = 1;
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient initialization of attention mask. Creating a list from a zero-filled array and then setting all values to 1 is wasteful. Use Enumerable.Repeat(1, tokenIds.Count).ToList() instead for more concise and efficient initialization.

Copilot uses AI. Check for mistakes.
{
_tokenToId = new Dictionary<string, int>(tokenToId);
_idToToken = tokenToId.ToDictionary(kvp => kvp.Value, kvp => kvp.Key);
_nextId = tokenToId.Values.Max() + 1;
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling Max() on an empty collection will throw an InvalidOperationException. If the tokenToId dictionary is empty, this constructor will fail. Add a check: _nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0;

Suggested change
_nextId = tokenToId.Values.Max() + 1;
_nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0;

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +207
foreach (Match match in matches)
{
if (!string.IsNullOrWhiteSpace(match.Value))
{
parts.Add(match.Value);
}
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Suggested change
foreach (Match match in matches)
{
if (!string.IsNullOrWhiteSpace(match.Value))
{
parts.Add(match.Value);
}
foreach (Match match in matches.Cast<Match>().Where(m => !string.IsNullOrWhiteSpace(m.Value)))
{
parts.Add(match.Value);

Copilot uses AI. Check for mistakes.
Comment on lines +100 to +110
foreach (var line in mergeLines)
{
if (string.IsNullOrWhiteSpace(line) || line.StartsWith("#"))
continue;

var parts = line.Split(' ');
if (parts.Length >= 2)
{
merges[(parts[0], parts[1])] = order++;
}
}
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot uses AI. Check for mistakes.
_tokenToId = new Dictionary<string, int>(tokenToId);
_idToToken = tokenToId.ToDictionary(kvp => kvp.Value, kvp => kvp.Key);
_nextId = tokenToId.Values.Max() + 1;
_unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0;
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient use of 'ContainsKey' and indexer.

Suggested change
_unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0;
_tokenToId.TryGetValue(unkToken, out _unkTokenId);

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +71
if (_tokenToId.ContainsKey(token))
return _tokenToId[token];
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient use of 'ContainsKey' and indexer.

Suggested change
if (_tokenToId.ContainsKey(token))
return _tokenToId[token];
if (_tokenToId.TryGetValue(token, out var id))
return id;

Copilot uses AI. Check for mistakes.
ITokenizer baseTokenizer,
ProgrammingLanguage language = ProgrammingLanguage.Generic,
bool splitIdentifiers = true)
: base(baseTokenizer.Vocabulary, baseTokenizer.SpecialTokens)
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable baseTokenizer may be null at this access as suggested by this null check.

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +143
catch
{
// If JSON parsing fails, try as text file
vocabDict = new Dictionary<string, int>();
var lines = File.ReadAllLines(vocabPath);
for (int i = 0; i < lines.Length; i++)
{
if (!string.IsNullOrWhiteSpace(lines[i]))
{
vocabDict[lines[i].Trim()] = i;
}
}
}
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic catch clause.

Copilot uses AI. Check for mistakes.
Comment on lines +191 to +194
if (side == "left")
return tokens.Skip(tokens.Count - maxLength).ToList();
else
return tokens.Take(maxLength).ToList();
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both branches of this 'if' statement return - consider using '?' to express intent better.

Suggested change
if (side == "left")
return tokens.Skip(tokens.Count - maxLength).ToList();
else
return tokens.Take(maxLength).ToList();
return side == "left"
? tokens.Skip(tokens.Count - maxLength).ToList()
: tokens.Take(maxLength).ToList();

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants