-
Notifications
You must be signed in to change notification settings - Fork 13.4k
convert : use reflinks for faster conversion #15727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
compilade
wants to merge
13
commits into
compilade/convert-safetensors-parse
Choose a base branch
from
compilade/convert-reflinks
base: compilade/convert-safetensors-parse
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
convert : use reflinks for faster conversion #15727
compilade
wants to merge
13
commits into
compilade/convert-safetensors-parse
from
compilade/convert-reflinks
+384
−77
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* convert : use direct copies when possible Using os.copy_file_range where available, and falling back to shutil.copyfileobj otherwise. * gguf : handle misaligned offset more cleanly
This matches the previous behavior for BF16 tensors.
* gguf-py : move reflinking functions to lazy
786b32d
to
e582f1a
Compare
76d2ab2
to
833d03c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(targets #15667 because file ranges are required)
This adds a
--reflink
option toconvert_hf_to_gguf.py
to allow using Copy-On-Write features on some filesystems (BTRFS, XFS, and likely ZFS).For the models where it works, it makes conversion extremely fast, and saves quite a lot of disk space (because most of the resulting model shares extents with the source safetensors files).
With
--verbose
there is additional logging for when reflinking falls back to a copy because of misalignment.Warning
This is experimental; The models produced with the
--reflink
option can have incompatible alignment with what current mainlinellama.cpp
expect.It might also produce broken models in some cases. Further testing is needed.
Results
Using
--reflink
withconvert_hf_to_gguf.py
will show the size after reflinking in the writing plan. This also works with--dry-run
. Note that if the underlying filesystem or platform doesn't support reflinks, it will fallback to direct copies, but the size will still show as if reflinking was supported, even though it's not.--reflink
Some models are very easily reflinkable (e.g.
OLMo-2-0325-32B-Instruct
), while some are not (TODO: add more examples of poorly reflinkable models).Generally, dense models which have no or very minimal tensor transformations in their
modify_tensors
part of the conversion should reflink really well.For MoE models which require expert tensor stacking (e.g. Jamba, GLM-4.5-Air, etc.), part of the model is not reflinkable because of incompatible alignment of the stacked tensor parts. This is inevitable, and the best that can be done is to pick the most common alignment out of the stacked parts, and copy what can't be reflinked.
The other case when something isn't reflinkable is where file-range tracking isn't implemented, like permutes (which is used for some of the Attention tensors of Mistral-7B-v0.1) and splitting (which is used for Bloom).
And of course the 1D F32 tensors (e.g. the norms) aren't reflinked because they have a different type than the source file.
Why this shouldn't be possible
My first iteration of this used plain
os.copy_file_range
and directly gave it the file offsets. This did not work (compsize
showed no significant sharing), because apparently the COW filesystems require the blocks to be aligned (for BOTH the source and the destination), or else no reflink will be made.From https://manpages.debian.org/trixie/manpages-dev/FICLONERANGE.2const.en.html#EINVAL:
Why this is possible
It's possible to "cheat" so that the block alignment offset of the destination file matches the source file. Obviously, this means a model converted with
--reflink
will not have the same alignment as one converted without--reflink
.This is only possible because of
general.alignment
which affects the alignment of the start of the data offsets. Otherwise aligning filesystem block offsets would be much more complicated, because the offsets would depend on the size of the metadata (which the offsets are part of). Technically, this is also possible the other way around, becausesafetensors
format allows arbitrary padding for the metadata (which can be made aligned), but it would require a custom writer, and this is out of scope of this PR, only a fun fact.I've also made
BF16
not round-trip toF32
for easier file-range tracking, and made--outtype auto
attempt to preserve the source types instead of guessing from the first tensor.--outtype auto
is the new default, because it's likely what most people expect when converting a model without specifying the type.What works and what doesn't
File ranges are tracked for very simple operations, like type-views, reshapes, and stacking.
It's not yet implemented for tensor splits, but could be.
Fallback to a non-direct copy without reflinks is implemented and should be relatively robust.
For some models, not all the ranges of a stacked tensor can be copied with the same block alignment offset. In that case, the best one is used, but that means up to half of the stacked tensor is copied without reflink sharing.
Permutes, transposes, and similar are not supported (and probably won't be), and fallback to not tracking the file ranges.
Notably, GPT-OSS has transposed MoE tensors in its BF16 version, and so it doesn't really benefit from reflinks.
TODO
compsize /path/to/model.gguf /path/to/model_dir
os.copy_file_range
.TENSOR_ALIGNMENT
to 8 causes problems in backends which use that constant (cpu
,repack
,amx
,kleidiai
)Make sure to read the contributing guidelines before submitting a PR