-
Notifications
You must be signed in to change notification settings - Fork 161
Optimizations for significantly faster tokenizer loading #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
DePasqualeOrg
wants to merge
33
commits into
huggingface:main
Choose a base branch
from
DePasqualeOrg:tokenizer-optimizations
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Optimizations for significantly faster tokenizer loading #303
DePasqualeOrg
wants to merge
33
commits into
huggingface:main
from
DePasqualeOrg:tokenizer-optimizations
+2,221
−251
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3b73277 to
93f0a84
Compare
93f0a84 to
070e687
Compare
Contributor
Author
|
#304 should be merged before this PR, which depends on it. |
4e611a9 to
0277421
Compare
0277421 to
568fac6
Compare
78de59d to
68e3879
Compare
042e7e9 to
17b77de
Compare
yyjson already handles BOM characters in strings correctly.
17b77de to
4080874
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, tokenizer loading is a major performance bottleneck in Swift, typically taking ~1400 ms compared to ~100 ms in Python.
This PR optimizes tokenizer loading for a 3.6x speedup, saving ~1020 ms. #302 and #304 in swift-transformers and #21 in swift-huggingface add further performance gains.
Performance
Tested with Qwen/Qwen3-0.6B-Base tokenizer (150k vocab, 150k merges) on M3 MacBook Pro.
A benchmark is included in
LoadingBenchmarks.swift, which can be removed before merging or kept to measure the impact of future changes. Run it withswift test --filter LoadingBenchmarks.Problems with the Current Implementation
Config wrapper overhead:
Config.convertToBinaryDistinctKeys()recursively wraps every JSON value. For 150k vocab + 150k merges, this means 300k+ object allocations taking ~1 second — only to be immediately unwrapped.Expensive merge lookups:
[BytePair: Int]uses string hashing for merge rank lookups, which is slow for 150k entries.Sequential initialization: Expensive dictionary building happens sequentially, leaving CPU cores idle.
Solutions Implemented in this PR
1. Config Bypass (~1000ms saved)
Extract vocab/merges directly from raw JSON before
Configconversion, passing them to new fast-path initializers.2. Integer-Packed Merge Keys (~180ms saved)
Replace
[BytePair: Int](expensive string hashing) with[UInt64: Int](fast integer hashing):3. Parallel Dictionary Building (~106ms saved)
Use
async letto build dictionaries concurrently:4. Conditional
stringToIdfor Unicode Edge CasesAdded an optional
stringToIdfallback dictionary for tokenizers with Unicode edge cases (e.g., Gemma's BOM-prefixed tokens). Only built when needed — most tokenizers skip this entirely, saving ~50ms.Backward Compatibility
Standard usage benefits from the faster loading automatically:
Direct use of
LanguageModelConfigurationFromHubcontinues to work unchanged. The default behavior preservestokenizerData.model.vocabandtokenizerData.model.mergesfor backward compatibility.For callers who want the performance optimization, opt in with
stripVocabForPerformance: trueand use the new properties:Custom Tokenizer Registration
Added
AutoTokenizer.register(_:for:)for registering custom tokenizer classes:This mirrors Python transformers'
AutoTokenizer.register(), which populatesREGISTERED_TOKENIZER_CLASSESfor lookup by class name.This makes it easy for downstream projects like mlx-swift-lm to use the fast path via
AutoTokenizer.from()while still supporting custom tokenizer classes when needed.Testing
All existing tests pass.
Alignment with Python
These optimizations align with patterns in the Python tokenizers library:
Pairtype with tuple hashingconvert_to_native_format()Future Work
A new major version could make breaking changes for better ergonomics:
Tokenizer.load(from:)as primary entry pointTokenStoragetypeThis would reduce BPETokenizer from 4 initializers to 1 factory method while maintaining performance.