Optimize model loading performance#34
Merged
davidkoski merged 4 commits intoml-explore:mainfrom Jan 8, 2026
Merged
Conversation
01ac099 to
0417a58
Compare
Collaborator
|
Great idea! I will give this a look in the first week of Jan when I am back |
49d2e79 to
3ea6266
Compare
b8b55f6 to
d98fea8
Compare
d98fea8 to
2b855bb
Compare
2b855bb to
12c48ac
Compare
Contributor
Author
|
This is now ready for review. |
davidkoski
approved these changes
Jan 8, 2026
Collaborator
davidkoski
left a comment
There was a problem hiding this comment.
Changes look great, thank you!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallel loading of tokenizer and weights
Tokenizer loading now runs concurrently with weight loading. For vision models,
preprocessor_config.jsonis also loaded in parallel.Single
config.jsonreadconfig.jsonwas being read from disk twice during model loading. Now it's read once and the data is reused for both the base config and model-specific config decoding.Benchmark results
I've added model loading benchmarks in a separate test target that won't run in CI.
To show the total improvement of all my optimizations, I ran the benchmark with #33 in this repo, including pending optimizations #302, #303, and #304 in swift-transformers and the Hub client in offline mode to avoid a network call. After swift-transformers migrates to swift-huggingface for the Hub API, specify a model repo revision instead to avoid the network call:
These are the results of the improvement in this PR with swift-transformers in its current, unoptimized state:
The cumulative result of all my improvements is that model loading time goes from ~3900–4500 ms to ~300–360 ms.