-
-
Notifications
You must be signed in to change notification settings - Fork 7
Fix issue 406 #438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix issue 406 #438
Conversation
This commit implements a comprehensive tokenization framework for AiDotNet, replacing the naive whitespace tokenization with state-of-the-art subword tokenization algorithms required by modern NLP systems. Core Tokenizers Implemented: - BPE (Byte-Pair Encoding) for GPT models - WordPiece for BERT-family models - SentencePiece (Unigram) for multilingual models Key Features: - Vocabulary training from corpus - Special tokens management ([CLS], [SEP], [PAD], [UNK], [MASK], etc.) - Encoding/decoding with padding and truncation - Attention mask generation - HuggingFace pretrained tokenizer compatibility - Load/save tokenizers in HuggingFace format - Batch encoding/decoding support Code Tokenization: - Language-aware tokenization (C#, Python, Java, JavaScript, TypeScript) - Identifier splitting (camelCase, snake_case, PascalCase) - Keyword recognition - CodeBERT-compatible tokenizer for program synthesis - Combined code + natural language encoding Implementation Details: - 16 new source files in src/Tokenization/ - Complete interfaces (ITokenizer, IVocabulary) - Abstract base class (TokenizerBase) for common functionality - Three algorithm implementations with training support - HuggingFace compatibility layer - Code-specific tokenization support - Comprehensive test suite (4 test files) - Full documentation (README.md) This resolves issue #406 and unblocks: - Issue #404: Program Synthesis (CodeBERT tokenizer ready) - Issues #269-273: Multimodal systems - All BERT/GPT/T5 model implementations Files created: 20 total - 14 implementation files - 2 HuggingFace compatibility files - 4 test files
|
Warning Rate limit exceeded@ooples has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 6 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (20)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a comprehensive modern tokenization framework for AiDotNet, adding support for state-of-the-art subword tokenization algorithms (BPE, WordPiece, SentencePiece), HuggingFace compatibility, and specialized code tokenization capabilities. This addresses Issue #406 and unblocks multiple downstream features.
- BPE, WordPiece, and SentencePiece tokenizers with training from corpus
- HuggingFace pretrained tokenizer loading and saving
- Language-aware code tokenization with identifier splitting for C#, Python, Java, JavaScript
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
src/Tokenization/Interfaces/ITokenizer.cs |
Defines tokenizer interface with encode/decode operations |
src/Tokenization/Interfaces/IVocabulary.cs |
Defines vocabulary management interface |
src/Tokenization/Models/TokenizationResult.cs |
Result model containing tokens, IDs, and attention masks |
src/Tokenization/Models/EncodingOptions.cs |
Configuration model for encoding with padding and truncation options |
src/Tokenization/Models/SpecialTokens.cs |
Special token management with factory methods for BERT/GPT/T5 styles |
src/Tokenization/Core/TokenizerBase.cs |
Abstract base class implementing common tokenization functionality |
src/Tokenization/Vocabulary/Vocabulary.cs |
Token-to-ID mapping implementation with unknown token handling |
src/Tokenization/Algorithms/BpeTokenizer.cs |
Byte-Pair Encoding implementation with merge-based tokenization |
src/Tokenization/Algorithms/WordPieceTokenizer.cs |
WordPiece algorithm with greedy longest-match-first approach |
src/Tokenization/Algorithms/SentencePieceTokenizer.cs |
Unigram language model with Viterbi segmentation |
src/Tokenization/HuggingFace/TokenizerConfig.cs |
HuggingFace configuration format model |
src/Tokenization/HuggingFace/HuggingFaceTokenizerLoader.cs |
Loads and saves pretrained tokenizers in HuggingFace format |
src/Tokenization/CodeTokenization/CodeTokenizer.cs |
Language-aware tokenizer with identifier splitting and keyword recognition |
src/Tokenization/CodeTokenization/CodeBertTokenizer.cs |
CodeBERT-compatible tokenizer for code and natural language |
tests/AiDotNet.Tests/Tokenization/VocabularyTests.cs |
Tests for vocabulary operations |
tests/AiDotNet.Tests/Tokenization/BpeTokenizerTests.cs |
Tests for BPE tokenizer functionality |
tests/AiDotNet.Tests/Tokenization/WordPieceTokenizerTests.cs |
Tests for WordPiece tokenizer |
tests/AiDotNet.Tests/Tokenization/CodeTokenizerTests.cs |
Tests for code tokenization features |
src/Tokenization/README.md |
Comprehensive documentation with usage examples |
TOKENIZATION_IMPLEMENTATION_SUMMARY.md |
Implementation summary and architecture overview |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Truncate if necessary | ||
| if (options.Truncation && options.MaxLength.HasValue && allTokens.Count > options.MaxLength.Value) | ||
| { | ||
| allTokens = allTokens.Take(options.MaxLength.Value - 1).ToList(); |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing using directive for System.Linq. The Take extension method requires using System.Linq; which is not present in the file imports.
| var attentionMask = new List<int>(new int[tokenIds.Count]); | ||
| for (int i = 0; i < attentionMask.Count; i++) attentionMask[i] = 1; |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient initialization of attention mask. Creating a list from a zero-filled array and then setting all values to 1 is wasteful. Use Enumerable.Repeat(1, tokenIds.Count).ToList() instead for more concise and efficient initialization.
| { | ||
| _tokenToId = new Dictionary<string, int>(tokenToId); | ||
| _idToToken = tokenToId.ToDictionary(kvp => kvp.Value, kvp => kvp.Key); | ||
| _nextId = tokenToId.Values.Max() + 1; |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling Max() on an empty collection will throw an InvalidOperationException. If the tokenToId dictionary is empty, this constructor will fail. Add a check: _nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0;
| _nextId = tokenToId.Values.Max() + 1; | |
| _nextId = tokenToId.Count > 0 ? tokenToId.Values.Max() + 1 : 0; |
| foreach (Match match in matches) | ||
| { | ||
| if (!string.IsNullOrWhiteSpace(match.Value)) | ||
| { | ||
| parts.Add(match.Value); | ||
| } |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| foreach (Match match in matches) | |
| { | |
| if (!string.IsNullOrWhiteSpace(match.Value)) | |
| { | |
| parts.Add(match.Value); | |
| } | |
| foreach (Match match in matches.Cast<Match>().Where(m => !string.IsNullOrWhiteSpace(m.Value))) | |
| { | |
| parts.Add(match.Value); |
| foreach (var line in mergeLines) | ||
| { | ||
| if (string.IsNullOrWhiteSpace(line) || line.StartsWith("#")) | ||
| continue; | ||
|
|
||
| var parts = line.Split(' '); | ||
| if (parts.Length >= 2) | ||
| { | ||
| merges[(parts[0], parts[1])] = order++; | ||
| } | ||
| } |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This foreach loop implicitly filters its target sequence - consider filtering the sequence explicitly using '.Where(...)'.
| _tokenToId = new Dictionary<string, int>(tokenToId); | ||
| _idToToken = tokenToId.ToDictionary(kvp => kvp.Value, kvp => kvp.Key); | ||
| _nextId = tokenToId.Values.Max() + 1; | ||
| _unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0; |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient use of 'ContainsKey' and indexer.
| _unkTokenId = _tokenToId.ContainsKey(unkToken) ? _tokenToId[unkToken] : 0; | |
| _tokenToId.TryGetValue(unkToken, out _unkTokenId); |
| if (_tokenToId.ContainsKey(token)) | ||
| return _tokenToId[token]; |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient use of 'ContainsKey' and indexer.
| if (_tokenToId.ContainsKey(token)) | |
| return _tokenToId[token]; | |
| if (_tokenToId.TryGetValue(token, out var id)) | |
| return id; |
| ITokenizer baseTokenizer, | ||
| ProgrammingLanguage language = ProgrammingLanguage.Generic, | ||
| bool splitIdentifiers = true) | ||
| : base(baseTokenizer.Vocabulary, baseTokenizer.SpecialTokens) |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable baseTokenizer may be null at this access as suggested by this null check.
| catch | ||
| { | ||
| // If JSON parsing fails, try as text file | ||
| vocabDict = new Dictionary<string, int>(); | ||
| var lines = File.ReadAllLines(vocabPath); | ||
| for (int i = 0; i < lines.Length; i++) | ||
| { | ||
| if (!string.IsNullOrWhiteSpace(lines[i])) | ||
| { | ||
| vocabDict[lines[i].Trim()] = i; | ||
| } | ||
| } | ||
| } |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic catch clause.
| if (side == "left") | ||
| return tokens.Skip(tokens.Count - maxLength).ToList(); | ||
| else | ||
| return tokens.Take(maxLength).ToList(); |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both branches of this 'if' statement return - consider using '?' to express intent better.
| if (side == "left") | |
| return tokens.Skip(tokens.Count - maxLength).ToList(); | |
| else | |
| return tokens.Take(maxLength).ToList(); | |
| return side == "left" | |
| ? tokens.Skip(tokens.Count - maxLength).ToList() | |
| : tokens.Take(maxLength).ToList(); |
This commit implements a comprehensive tokenization framework for AiDotNet, replacing the naive whitespace tokenization with state-of-the-art subword tokenization algorithms required by modern NLP systems.
Core Tokenizers Implemented:
Key Features:
Code Tokenization:
Implementation Details:
This resolves issue #406 and unblocks:
Files created: 20 total
User Story / Context
merge-dev2-to-masterSummary
Verification
Copilot Review Loop (Outcome-Based)
Record counts before/after your last push:
Files Modified
Notes