Fix issue 406 #438

ooples · 2025-11-08T17:59:16Z

This commit implements a comprehensive tokenization framework for AiDotNet, replacing the naive whitespace tokenization with state-of-the-art subword tokenization algorithms required by modern NLP systems.

Core Tokenizers Implemented:

BPE (Byte-Pair Encoding) for GPT models
WordPiece for BERT-family models
SentencePiece (Unigram) for multilingual models

Key Features:

Vocabulary training from corpus
Special tokens management ([CLS], [SEP], [PAD], [UNK], [MASK], etc.)
Encoding/decoding with padding and truncation
Attention mask generation
HuggingFace pretrained tokenizer compatibility
Load/save tokenizers in HuggingFace format
Batch encoding/decoding support

Code Tokenization:

Language-aware tokenization (C#, Python, Java, JavaScript, TypeScript)
Identifier splitting (camelCase, snake_case, PascalCase)
Keyword recognition
CodeBERT-compatible tokenizer for program synthesis
Combined code + natural language encoding

Implementation Details:

16 new source files in src/Tokenization/
Complete interfaces (ITokenizer, IVocabulary)
Abstract base class (TokenizerBase) for common functionality
Three algorithm implementations with training support
HuggingFace compatibility layer
Code-specific tokenization support
Comprehensive test suite (4 test files)
Full documentation (README.md)

This resolves issue #406 and unblocks:

Issue [Phase 3] Implement Program Synthesis and Code Generation #404: Program Synthesis (CodeBERT tokenizer ready)
Issues Automatic Speech Recognition (ASR) – Whisper/Conformer Architecture #269-273: Multimodal systems
All BERT/GPT/T5 model implementations

Files created: 20 total

14 implementation files
2 HuggingFace compatibility files
4 test files

User Story / Context

Reference: [US-XXX] (if applicable)
Base branch: merge-dev2-to-master

Summary

What changed and why (scoped strictly to the user story / PR intent)

Verification

Builds succeed (scoped to changed projects)
Unit tests pass locally
Code coverage >= 90% for touched code
Codecov upload succeeded (if token configured)
TFM verification (net46, net6.0, net8.0) passes (if packaging)
No unresolved Copilot comments on HEAD

Copilot Review Loop (Outcome-Based)

Record counts before/after your last push:

Comments on HEAD BEFORE: [N]
Comments on HEAD AFTER (60s): [M]
Final HEAD SHA: [sha]

Files Modified

List files changed (must align with scope)

Notes

Any follow-ups, caveats, or migration details

This commit implements a comprehensive tokenization framework for AiDotNet, replacing the naive whitespace tokenization with state-of-the-art subword tokenization algorithms required by modern NLP systems. Core Tokenizers Implemented: - BPE (Byte-Pair Encoding) for GPT models - WordPiece for BERT-family models - SentencePiece (Unigram) for multilingual models Key Features: - Vocabulary training from corpus - Special tokens management ([CLS], [SEP], [PAD], [UNK], [MASK], etc.) - Encoding/decoding with padding and truncation - Attention mask generation - HuggingFace pretrained tokenizer compatibility - Load/save tokenizers in HuggingFace format - Batch encoding/decoding support Code Tokenization: - Language-aware tokenization (C#, Python, Java, JavaScript, TypeScript) - Identifier splitting (camelCase, snake_case, PascalCase) - Keyword recognition - CodeBERT-compatible tokenizer for program synthesis - Combined code + natural language encoding Implementation Details: - 16 new source files in src/Tokenization/ - Complete interfaces (ITokenizer, IVocabulary) - Abstract base class (TokenizerBase) for common functionality - Three algorithm implementations with training support - HuggingFace compatibility layer - Code-specific tokenization support - Comprehensive test suite (4 test files) - Full documentation (README.md) This resolves issue #406 and unblocks: - Issue #404: Program Synthesis (CodeBERT tokenizer ready) - Issues #269-273: Multimodal systems - All BERT/GPT/T5 model implementations Files created: 20 total - 14 implementation files - 2 HuggingFace compatibility files - 4 test files

coderabbitai · 2025-11-08T17:59:27Z

Warning

Rate limit exceeded

@ooples has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 6 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 82c9b67 and d5203d0.

📒 Files selected for processing (20)

TOKENIZATION_IMPLEMENTATION_SUMMARY.md (1 hunks)
src/Tokenization/Algorithms/BpeTokenizer.cs (1 hunks)
src/Tokenization/Algorithms/SentencePieceTokenizer.cs (1 hunks)
src/Tokenization/Algorithms/WordPieceTokenizer.cs (1 hunks)
src/Tokenization/CodeTokenization/CodeBertTokenizer.cs (1 hunks)
src/Tokenization/CodeTokenization/CodeTokenizer.cs (1 hunks)
src/Tokenization/Core/TokenizerBase.cs (1 hunks)
src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs (1 hunks)
src/Tokenization/HuggingFace/TokenizerConfig.cs (1 hunks)
src/Tokenization/Interfaces/ITokenizer.cs (1 hunks)
src/Tokenization/Interfaces/IVocabulary.cs (1 hunks)
src/Tokenization/Models/EncodingOptions.cs (1 hunks)
src/Tokenization/Models/SpecialTokens.cs (1 hunks)
src/Tokenization/Models/TokenizationResult.cs (1 hunks)
src/Tokenization/README.md (1 hunks)
src/Tokenization/Vocabulary/Vocabulary.cs (1 hunks)
tests/AiDotNet.Tests/Tokenization/BpeTokenizerTests.cs (1 hunks)
tests/AiDotNet.Tests/Tokenization/CodeTokenizerTests.cs (1 hunks)
tests/AiDotNet.Tests/Tokenization/VocabularyTests.cs (1 hunks)
tests/AiDotNet.Tests/Tokenization/WordPieceTokenizerTests.cs (1 hunks)

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch claude/fix-issue-406-011CUvqPDDa4qHeWevRhmZPR

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull Request Overview

This PR implements a comprehensive modern tokenization framework for AiDotNet, adding support for state-of-the-art subword tokenization algorithms (BPE, WordPiece, SentencePiece), HuggingFace compatibility, and specialized code tokenization capabilities. This addresses Issue #406 and unblocks multiple downstream features.

BPE, WordPiece, and SentencePiece tokenizers with training from corpus
HuggingFace pretrained tokenizer loading and saving
Language-aware code tokenization with identifier splitting for C#, Python, Java, JavaScript

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
`src/Tokenization/Interfaces/ITokenizer.cs`	Defines tokenizer interface with encode/decode operations
`src/Tokenization/Interfaces/IVocabulary.cs`	Defines vocabulary management interface
`src/Tokenization/Models/TokenizationResult.cs`	Result model containing tokens, IDs, and attention masks
`src/Tokenization/Models/EncodingOptions.cs`	Configuration model for encoding with padding and truncation options
`src/Tokenization/Models/SpecialTokens.cs`	Special token management with factory methods for BERT/GPT/T5 styles
`src/Tokenization/Core/TokenizerBase.cs`	Abstract base class implementing common tokenization functionality
`src/Tokenization/Vocabulary/Vocabulary.cs`	Token-to-ID mapping implementation with unknown token handling
`src/Tokenization/Algorithms/BpeTokenizer.cs`	Byte-Pair Encoding implementation with merge-based tokenization
`src/Tokenization/Algorithms/WordPieceTokenizer.cs`	WordPiece algorithm with greedy longest-match-first approach
`src/Tokenization/Algorithms/SentencePieceTokenizer.cs`	Unigram language model with Viterbi segmentation
`src/Tokenization/HuggingFace/TokenizerConfig.cs`	HuggingFace configuration format model
`src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs`	Loads and saves pretrained tokenizers in HuggingFace format
`src/Tokenization/CodeTokenization/CodeTokenizer.cs`	Language-aware tokenizer with identifier splitting and keyword recognition
`src/Tokenization/CodeTokenization/CodeBertTokenizer.cs`	CodeBERT-compatible tokenizer for code and natural language
`tests/AiDotNet.Tests/Tokenization/VocabularyTests.cs`	Tests for vocabulary operations
`tests/AiDotNet.Tests/Tokenization/BpeTokenizerTests.cs`	Tests for BPE tokenizer functionality
`tests/AiDotNet.Tests/Tokenization/WordPieceTokenizerTests.cs`	Tests for WordPiece tokenizer
`tests/AiDotNet.Tests/Tokenization/CodeTokenizerTests.cs`	Tests for code tokenization features
`src/Tokenization/README.md`	Comprehensive documentation with usage examples
`TOKENIZATION_IMPLEMENTATION_SUMMARY.md`	Implementation summary and architecture overview

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-08T18:05:35Z

src/Tokenization/CodeTokenization/CodeBertTokenizer.cs

+            // Truncate if necessary
+            if (options.Truncation && options.MaxLength.HasValue && allTokens.Count > options.MaxLength.Value)
+            {
+                allTokens = allTokens.Take(options.MaxLength.Value - 1).ToList();


Missing using directive for System.Linq. The Take extension method requires using System.Linq; which is not present in the file imports.

Copilot · 2025-11-08T18:05:35Z

src/Tokenization/CodeTokenization/CodeBertTokenizer.cs

+            var attentionMask = new List<int>(new int[tokenIds.Count]);
+            for (int i = 0; i < attentionMask.Count; i++) attentionMask[i] = 1;


Inefficient initialization of attention mask. Creating a list from a zero-filled array and then setting all values to 1 is wasteful. Use Enumerable.Repeat(1, tokenIds.Count).ToList() instead for more concise and efficient initialization.

Copilot · 2025-11-08T18:05:35Z

src/Tokenization/Vocabulary/Vocabulary.cs

+        {
+            _tokenToId = new Dictionary<string, int>(tokenToId);
+            _idToToken = tokenToId.ToDictionary(kvp => kvp.Value, kvp => kvp.Key);
+            _nextId = tokenToId.Values.Max() + 1;


Calling Max() on an empty collection will throw an InvalidOperationException. If the tokenToId dictionary is empty, this constructor will fail. Add a check: _nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0;

Suggested change

_nextId = tokenToId.Values.Max() + 1;

_nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0;

Copilot · 2025-11-08T18:05:35Z

src/Tokenization/CodeTokenization/CodeTokenizer.cs

+                foreach (Match match in matches)
+                {
+                    if (!string.IsNullOrWhiteSpace(match.Value))
+                    {
+                        parts.Add(match.Value);
+                    }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Suggested change

foreach (Match match in matches)

{

if (!string.IsNullOrWhiteSpace(match.Value))

{

parts.Add(match.Value);

}

foreach (Match match in matches.Cast<Match>().Where(m => !string.IsNullOrWhiteSpace(m.Value)))

{

parts.Add(match.Value);

Copilot · 2025-11-08T18:05:36Z

src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs

+            foreach (var line in mergeLines)
+            {
+                if (string.IsNullOrWhiteSpace(line) || line.StartsWith("#"))
+                    continue;
+
+                var parts = line.Split(' ');
+                if (parts.Length >= 2)
+                {
+                    merges[(parts[0], parts[1])] = order++;
+                }
+            }


This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.

Copilot · 2025-11-08T18:05:38Z

src/Tokenization/Vocabulary/Vocabulary.cs

+            _tokenToId = new Dictionary<string, int>(tokenToId);
+            _idToToken = tokenToId.ToDictionary(kvp => kvp.Value, kvp => kvp.Key);
+            _nextId = tokenToId.Values.Max() + 1;
+            _unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0;


Inefficient use of 'ContainsKey' and indexer.

Suggested change

_unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0;

_tokenToId.TryGetValue(unkToken, out _unkTokenId);

Copilot · 2025-11-08T18:05:38Z

src/Tokenization/Vocabulary/Vocabulary.cs

+            if (_tokenToId.ContainsKey(token))
+                return _tokenToId[token];


Inefficient use of 'ContainsKey' and indexer.

Suggested change

if (_tokenToId.ContainsKey(token))

return _tokenToId[token];

if (_tokenToId.TryGetValue(token, out var id))

return id;

Copilot · 2025-11-08T18:05:38Z

src/Tokenization/CodeTokenization/CodeTokenizer.cs

+            ITokenizer baseTokenizer,
+            ProgrammingLanguage language = ProgrammingLanguage.Generic,
+            bool splitIdentifiers = true)
+            : base(baseTokenizer.Vocabulary, baseTokenizer.SpecialTokens)


Variable baseTokenizer may be null at this access as suggested by this null check.

Copilot · 2025-11-08T18:05:38Z

src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs

+            catch
+            {
+                // If JSON parsing fails, try as text file
+                vocabDict = new Dictionary<string, int>();
+                var lines = File.ReadAllLines(vocabPath);
+                for (int i = 0; i < lines.Length; i++)
+                {
+                    if (!string.IsNullOrWhiteSpace(lines[i]))
+                    {
+                        vocabDict[lines[i].Trim()] = i;
+                    }
+                }
+            }


Generic catch clause.

Copilot · 2025-11-08T18:05:39Z

src/Tokenization/Core/TokenizerBase.cs

+            if (side == "left")
+                return tokens.Skip(tokens.Count - maxLength).ToList();
+            else
+                return tokens.Take(maxLength).ToList();


Both branches of this 'if' statement return - consider using '?' to express intent better.

Suggested change

if (side == "left")

return tokens.Skip(tokens.Count - maxLength).ToList();

else

return tokens.Take(maxLength).ToList();

return side == "left"

? tokens.Skip(tokens.Count - maxLength).ToList()

: tokens.Take(maxLength).ToList();

Copilot AI review requested due to automatic review settings November 8, 2025 17:59

Copilot AI reviewed Nov 8, 2025

View reviewed changes

		var attentionMask = new List<int>(new int[tokenIds.Count]);
		for (int i = 0; i < attentionMask.Count; i++) attentionMask[i] = 1;

	_nextId = tokenToId.Values.Max() + 1;
	_nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0;

	_unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0;
	_tokenToId.TryGetValue(unkToken, out _unkTokenId);

Uh oh!

Fix issue 406 #438

Are you sure you want to change the base?

Fix issue 406 #438

Uh oh!

Conversation

ooples commented Nov 8, 2025

User Story / Context

Summary

Verification

Copilot Review Loop (Outcome-Based)

Files Modified

Notes

Uh oh!

coderabbitai bot commented Nov 8, 2025

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants