Skip to content

Conversation

@DePasqualeOrg
Copy link
Contributor

@DePasqualeOrg DePasqualeOrg commented Dec 27, 2025

Currently, tokenizer loading is a major performance bottleneck in Swift, typically taking ~1400 ms compared to ~100 ms in Python.

This PR optimizes tokenizer loading for a 3.6x speedup, saving ~1020 ms. #302 and #304 in swift-transformers and #21 in swift-huggingface add further performance gains.

Performance

Metric Before This PR Speedup
Load time 1411ms 391ms 3.6x (1020 ms)

Tested with Qwen/Qwen3-0.6B-Base tokenizer (150k vocab, 150k merges) on M3 MacBook Pro.

A benchmark is included in LoadingBenchmarks.swift, which can be removed before merging or kept to measure the impact of future changes. Run it with swift test --filter LoadingBenchmarks.

Problems with the Current Implementation

  1. Config wrapper overhead: Config.convertToBinaryDistinctKeys() recursively wraps every JSON value. For 150k vocab + 150k merges, this means 300k+ object allocations taking ~1 second — only to be immediately unwrapped.

  2. Expensive merge lookups: [BytePair: Int] uses string hashing for merge rank lookups, which is slow for 150k entries.

  3. Sequential initialization: Expensive dictionary building happens sequentially, leaving CPU cores idle.

Solutions Implemented in this PR

1. Config Bypass (~1000ms saved)

Extract vocab/merges directly from raw JSON before Config conversion, passing them to new fast-path initializers.

2. Integer-Packed Merge Keys (~180ms saved)

Replace [BytePair: Int] (expensive string hashing) with [UInt64: Int] (fast integer hashing):

// Pack two token IDs into one UInt64
let key = UInt64(tokenIdA) << 32 | UInt64(tokenIdB)

3. Parallel Dictionary Building (~106ms saved)

Use async let to build dictionaries concurrently:

// Phase 1: Independent tasks
async let tokensToIdsTask = buildTokensToIds(...)
async let mergesTask = parseMerges(...)

// Phase 2: Dependent tasks (after Phase 1 completes)
async let bpeRanksTask = buildBpeRanks(...)
async let idsToTokensTask = buildIdsToTokens(...)

4. Conditional stringToId for Unicode Edge Cases

Added an optional stringToId fallback dictionary for tokenizers with Unicode edge cases (e.g., Gemma's BOM-prefixed tokens). Only built when needed — most tokenizers skip this entirely, saving ~50ms.

Backward Compatibility

Standard usage benefits from the faster loading automatically:

let tokenizer = try await AutoTokenizer.from(pretrained: "model-name")
let tokenizer = try await AutoTokenizer.from(modelFolder: url)

Direct use of LanguageModelConfigurationFromHub continues to work unchanged. The default behavior preserves tokenizerData.model.vocab and tokenizerData.model.merges for backward compatibility.

For callers who want the performance optimization, opt in with stripVocabForPerformance: true and use the new properties:

let config = LanguageModelConfigurationFromHub(
    modelName: "model-name",
    stripVocabForPerformance: true
)
let vocab = try await config.tokenizerVocab   // NSDictionary for BPE
let merges = try await config.tokenizerMerges // [Any] for BPE

Custom Tokenizer Registration

Added AutoTokenizer.register(_:for:) for registering custom tokenizer classes:

AutoTokenizer.register(MyTokenizer.self, for: "MyCustomTokenizer")

This mirrors Python transformers' AutoTokenizer.register(), which populates REGISTERED_TOKENIZER_CLASSES for lookup by class name.

This makes it easy for downstream projects like mlx-swift-lm to use the fast path via AutoTokenizer.from() while still supporting custom tokenizer classes when needed.

Testing

All existing tests pass.

Alignment with Python

These optimizations align with patterns in the Python tokenizers library:

  • Packed merge keys ≈ tokenizers Pair type with tuple hashing
  • Parallel building ≈ tokenizers Rayon-based parallelism
  • Raw JSON extraction ≈ tokenizers convert_to_native_format()
  • NSString for Unicode ≈ tokenizers byte-level preservation

Future Work

A new major version could make breaking changes for better ergonomics:

  • Async-first API: Tokenizer.load(from:) as primary entry point
  • Factory methods instead of protocol-mandated initializers
  • Hide Unicode complexity behind internal TokenStorage type

This would reduce BPETokenizer from 4 initializers to 1 factory method while maintaining performance.

@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 93f0a84 to 070e687 Compare December 27, 2025 21:39
@DePasqualeOrg
Copy link
Contributor Author

#304 should be merged before this PR, which depends on it.

@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 4e611a9 to 0277421 Compare January 5, 2026 09:08
@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 0277421 to 568fac6 Compare January 5, 2026 09:24
@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch 3 times, most recently from 78de59d to 68e3879 Compare January 5, 2026 11:03
@DePasqualeOrg DePasqualeOrg force-pushed the tokenizer-optimizations branch from 042e7e9 to 17b77de Compare January 5, 2026 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant