Optimizations for significantly faster tokenizer loading #303

DePasqualeOrg · 2025-12-27T14:10:00Z

Currently, tokenizer loading is a major performance bottleneck in Swift, typically taking ~1400 ms compared to ~100 ms in Python.

This PR optimizes tokenizer loading for a 3.6x speedup, saving ~1020 ms. #302 and #304 in swift-transformers and #21 in swift-huggingface add further performance gains.

Performance

Metric	Before	This PR	Speedup
Load time	1411ms	391ms	3.6x (1020 ms)

Tested with Qwen/Qwen3-0.6B-Base tokenizer (150k vocab, 150k merges) on M3 MacBook Pro.

A benchmark is included in LoadingBenchmarks.swift, which can be removed before merging or kept to measure the impact of future changes. Run it with swift test --filter LoadingBenchmarks.

Problems with the Current Implementation

Config wrapper overhead: Config.convertToBinaryDistinctKeys() recursively wraps every JSON value. For 150k vocab + 150k merges, this means 300k+ object allocations taking ~1 second — only to be immediately unwrapped.
Expensive merge lookups: [BytePair: Int] uses string hashing for merge rank lookups, which is slow for 150k entries.
Sequential initialization: Expensive dictionary building happens sequentially, leaving CPU cores idle.

Solutions Implemented in this PR

1. Config Bypass (~1000ms saved)

Extract vocab/merges directly from raw JSON before Config conversion, passing them to new fast-path initializers.

2. Integer-Packed Merge Keys (~180ms saved)

Replace [BytePair: Int] (expensive string hashing) with [UInt64: Int] (fast integer hashing):

// Pack two token IDs into one UInt64
let key = UInt64(tokenIdA) << 32 | UInt64(tokenIdB)

3. Parallel Dictionary Building (~106ms saved)

Use async let to build dictionaries concurrently:

// Phase 1: Independent tasks
async let tokensToIdsTask = buildTokensToIds(...)
async let mergesTask = parseMerges(...)

// Phase 2: Dependent tasks (after Phase 1 completes)
async let bpeRanksTask = buildBpeRanks(...)
async let idsToTokensTask = buildIdsToTokens(...)

4. Conditional `stringToId` for Unicode Edge Cases

Added an optional stringToId fallback dictionary for tokenizers with Unicode edge cases (e.g., Gemma's BOM-prefixed tokens). Only built when needed — most tokenizers skip this entirely, saving ~50ms.

Backward Compatibility

Standard usage benefits from the faster loading automatically:

let tokenizer = try await AutoTokenizer.from(pretrained: "model-name")
let tokenizer = try await AutoTokenizer.from(modelFolder: url)

Direct use of LanguageModelConfigurationFromHub continues to work unchanged. The default behavior preserves tokenizerData.model.vocab and tokenizerData.model.merges for backward compatibility.

For callers who want the performance optimization, opt in with stripVocabForPerformance: true and use the new properties:

let config = LanguageModelConfigurationFromHub(
    modelName: "model-name",
    stripVocabForPerformance: true
)
let vocab = try await config.tokenizerVocab   // NSDictionary for BPE
let merges = try await config.tokenizerMerges // [Any] for BPE

Custom Tokenizer Registration

Added AutoTokenizer.register(_:for:) for registering custom tokenizer classes:

AutoTokenizer.register(MyTokenizer.self, for: "MyCustomTokenizer")

This mirrors Python transformers' AutoTokenizer.register(), which populates REGISTERED_TOKENIZER_CLASSES for lookup by class name.

This makes it easy for downstream projects like mlx-swift-lm to use the fast path via AutoTokenizer.from() while still supporting custom tokenizer classes when needed.

Testing

All existing tests pass.

Alignment with Python

These optimizations align with patterns in the Python tokenizers library:

Packed merge keys ≈ tokenizers Pair type with tuple hashing
Parallel building ≈ tokenizers Rayon-based parallelism
Raw JSON extraction ≈ tokenizers convert_to_native_format()
NSString for Unicode ≈ tokenizers byte-level preservation

Future Work

A new major version could make breaking changes for better ergonomics:

Async-first API: Tokenizer.load(from:) as primary entry point
Factory methods instead of protocol-mandated initializers
Hide Unicode complexity behind internal TokenStorage type

This would reduce BPETokenizer from 4 initializers to 1 factory method while maintaining performance.

DePasqualeOrg · 2025-12-27T21:40:46Z

#304 should be merged before this PR, which depends on it.

yyjson already handles BOM characters in strings correctly.

DePasqualeOrg force-pushed the tokenizer-optimizations branch 2 times, most recently from 3b73277 to 93f0a84 Compare December 27, 2025 16:17

DePasqualeOrg mentioned this pull request Dec 27, 2025

Use AutoTokenizer.from() for faster tokenizer loading ml-explore/mlx-swift-lm#33

Draft

Use yyjson for significantly faster JSON parsing

a5f77c6

DePasqualeOrg force-pushed the tokenizer-optimizations branch from 93f0a84 to 070e687 Compare December 27, 2025 21:39

DePasqualeOrg mentioned this pull request Dec 28, 2025

Optimize model loading performance ml-explore/mlx-swift-lm#34

Merged

DePasqualeOrg added 9 commits January 5, 2026 00:52

Remove BOM flag (not needed: yyjson parses correctly)

7552a05

Use yyjson type-checking API

e2b6808

Preserve error message info

1a9c235

Handle integer overflow edge case in yyjson parser

c568e52

Improve benchmarks

26edb80

Add tests for yyjson

e16af86

Handle signed integer overflow edge case

bfda485

Handle empty data edge case

24eaf0b

Clean up (yyjson handles BOM characters correctly)

891a51b

DePasqualeOrg force-pushed the tokenizer-optimizations branch from 4e611a9 to 0277421 Compare January 5, 2026 09:08

DePasqualeOrg added 6 commits January 5, 2026 10:18

Add tokenizer loading benchmark

90a852e

Optimizations for significantly faster tokenizer loading

07f8fd3

Update knownTokenizers

df33f6b

Expose revision in LanguageModelConfigurationFromHub

4c05d63

Add custom tokenizer registration

e6caa94

Benchmark tests run separately (not in CI)

5e9d965

DePasqualeOrg force-pushed the tokenizer-optimizations branch from 0277421 to 568fac6 Compare January 5, 2026 09:24

DePasqualeOrg added 3 commits January 5, 2026 11:00

Add methods to yyjson parser

353e7b9

Replace force unwraps with errors

3675266

Add note for reviewers

701fc42

DePasqualeOrg force-pushed the tokenizer-optimizations branch 3 times, most recently from 78de59d to 68e3879 Compare January 5, 2026 11:03

DePasqualeOrg force-pushed the tokenizer-optimizations branch from 042e7e9 to 17b77de Compare January 5, 2026 11:23

DePasqualeOrg added 14 commits January 5, 2026 12:28

Add thread safety to tokenizer registration

d4ed857

Expand documentation for stripVocabForPerformance

23975e6

Add concurrent tokenizer registration test

7f9a66a

Replace force unwraps

48e8ab5

Check for empty data in yyjson parser

61bbb56

Make vocab type-safe for BPE and Unigram tokenizers

1814fb4

Document token ID packing edge case (will never happen in practice)

2fc746a

Remove unnecessary yyjson BOM flag

44c0f46

yyjson already handles BOM characters in strings correctly.

Use yyjson type-checking API

2670c22

Replace force unwraps in UnigramTokenizer

7fcb402

Improve benchmarks

ad65759

Add test for stripVocabForPerformance

ff64f37

PreTrainedTokenizer: use thread-safe cache for compiled Jinja templates

c50c27a

Make tokenizer merge rules type-safe

4080874

DePasqualeOrg force-pushed the tokenizer-optimizations branch from 17b77de to 4080874 Compare January 5, 2026 11:28

DePasqualeOrg mentioned this pull request Jan 5, 2026

Use yyjson for significantly faster JSON parsing #304

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for significantly faster tokenizer loading #303

Optimizations for significantly faster tokenizer loading #303

DePasqualeOrg commented Dec 27, 2025 •

edited

Loading

Uh oh!

DePasqualeOrg commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Optimizations for significantly faster tokenizer loading #303

Are you sure you want to change the base?

Optimizations for significantly faster tokenizer loading #303

Conversation

DePasqualeOrg commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Problems with the Current Implementation

Solutions Implemented in this PR

1. Config Bypass (~1000ms saved)

2. Integer-Packed Merge Keys (~180ms saved)

3. Parallel Dictionary Building (~106ms saved)

4. Conditional stringToId for Unicode Edge Cases

Backward Compatibility

Custom Tokenizer Registration

Testing

Alignment with Python

Future Work

Uh oh!

DePasqualeOrg commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DePasqualeOrg commented Dec 27, 2025 •

edited

Loading

4. Conditional `stringToId` for Unicode Edge Cases