convert : use reflinks for faster conversion #15727

compilade · 2025-09-02T02:51:52Z

(targets #15667 because file ranges are required)

This adds a --reflink option to convert_hf_to_gguf.py to allow using Copy-On-Write features on some filesystems (BTRFS, XFS, and likely ZFS).

For the models where it works, it makes conversion extremely fast, and saves quite a lot of disk space (because most of the resulting model shares extents with the source safetensors files).

With --verbose there is additional logging for when reflinking falls back to a copy because of misalignment.

Warning

This is experimental; The models produced with the --reflink option can have incompatible alignment with what current mainline llama.cpp expect.
It might also produce broken models in some cases. Further testing is needed.

Results

Using --reflink with convert_hf_to_gguf.py will show the size after reflinking in the writing plan. This also works with --dry-run. Note that if the underlying filesystem or platform doesn't support reflinks, it will fallback to direct copies, but the size will still show as if reflinking was supported, even though it's not.

Model	Size	Unique size with `--reflink`	% of original size
https://huggingface.co/allenai/OLMo-2-0325-32B-Instruct	64.5 GB	4.2 MB	0.0065 %
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2	17.8 GB	229.0 MB	1.27 %
https://huggingface.co/meta-llama/Llama-3.2-1B	2.5 GB	168.0 MB	6.72 %
https://huggingface.co/mistralai/Mistral-7B-v0.1	14.5 GB	1.3 GB	8.97 %
https://huggingface.co/bigscience/bloom-560m	1.1 GB	152.3 MB	13.8 %
https://huggingface.co/TheDrummer/GLM-Steam-106B-A12B-v1	221.0 GB	40.6 GB	18.4 %
https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.7	103.2 GB	26.3 GB	25.5 %

Some models are very easily reflinkable (e.g. OLMo-2-0325-32B-Instruct), while some are not (TODO: add more examples of poorly reflinkable models).
Generally, dense models which have no or very minimal tensor transformations in their modify_tensors part of the conversion should reflink really well.
For MoE models which require expert tensor stacking (e.g. Jamba, GLM-4.5-Air, etc.), part of the model is not reflinkable because of incompatible alignment of the stacked tensor parts. This is inevitable, and the best that can be done is to pick the most common alignment out of the stacked parts, and copy what can't be reflinked.
The other case when something isn't reflinkable is where file-range tracking isn't implemented, like permutes (which is used for some of the Attention tensors of Mistral-7B-v0.1) and splitting (which is used for Bloom).
And of course the 1D F32 tensors (e.g. the norms) aren't reflinked because they have a different type than the source file.

Why this shouldn't be possible

My first iteration of this used plain os.copy_file_range and directly gave it the file offsets. This did not work (compsize showed no significant sharing), because apparently the COW filesystems require the blocks to be aligned (for BOTH the source and the destination), or else no reflink will be made.

From https://manpages.debian.org/trixie/manpages-dev/FICLONERANGE.2const.en.html#EINVAL:

Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size.

Why this is possible

It's possible to "cheat" so that the block alignment offset of the destination file matches the source file. Obviously, this means a model converted with --reflink will not have the same alignment as one converted without --reflink.

This is only possible because of general.alignment which affects the alignment of the start of the data offsets. Otherwise aligning filesystem block offsets would be much more complicated, because the offsets would depend on the size of the metadata (which the offsets are part of). Technically, this is also possible the other way around, because safetensors format allows arbitrary padding for the metadata (which can be made aligned), but it would require a custom writer, and this is out of scope of this PR, only a fun fact.

I've also made BF16 not round-trip to F32 for easier file-range tracking, and made --outtype auto attempt to preserve the source types instead of guessing from the first tensor.

--outtype auto is the new default, because it's likely what most people expect when converting a model without specifying the type.

What works and what doesn't

File ranges are tracked for very simple operations, like type-views, reshapes, and stacking.

It's not yet implemented for tensor splits, but could be.

Fallback to a non-direct copy without reflinks is implemented and should be relatively robust.

For some models, not all the ranges of a stacked tensor can be copied with the same block alignment offset. In that case, the best one is used, but that means up to half of the stacked tensor is copied without reflink sharing.

Permutes, transposes, and similar are not supported (and probably won't be), and fallback to not tracking the file ranges.

Notably, GPT-OSS has transposed MoE tensors in its BF16 version, and so it doesn't really benefit from reflinks.

TODO

Make sure to read the contributing guidelines before submitting a PR

* convert : use direct copies when possible Using os.copy_file_range where available, and falling back to shutil.copyfileobj otherwise. * gguf : handle misaligned offset more cleanly

This matches the previous behavior for BF16 tensors.

* gguf-py : move reflinking functions to lazy

compilade added demo Demonstrate some concept or idea, not intended to be merged python python script changes labels Sep 2, 2025

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 2, 2025

compilade mentioned this pull request Sep 3, 2025

gguf-py: reduce peak RAM during convert by streaming dtype casts #15648

Open

compilade added 13 commits September 9, 2025 14:36

convert : use reflinks for faster conversion

f7394cd

convert : fix reflinks for stacked MoE tensors

7724bf9

gguf-py : fix flake8 lint

34bd024

convert : detect filesystem block size for reflinks

6792f66

* convert : use direct copies when possible Using os.copy_file_range where available, and falling back to shutil.copyfileobj otherwise. * gguf : handle misaligned offset more cleanly

convert : use F32 operations on Mamba A_log

fb879b4

This matches the previous behavior for BF16 tensors.

convert : allow sharding reflinked models

cec3449

gguf-py : improve reflink size logging

ec07416

* gguf-py : move reflinking functions to lazy

convert : more robust default ftype detection

be600e2

convert : remove unused field ModelTensorInfo.src_qtype

8ef4136

gguf-py : allow previewing reflinked size on non-Linux platforms

2499e47

convert : better logging of partially reflinkable tensors

34f37c2

gguf-py : handle cross-filesystem file range copies

34680f0

convert : for FP8, use scale type to decide auto type

833d03c

compilade force-pushed the compilade/convert-safetensors-parse branch from 786b32d to e582f1a Compare September 9, 2025 18:49

compilade force-pushed the compilade/convert-reflinks branch from 76d2ab2 to 833d03c Compare September 9, 2025 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

convert : use reflinks for faster conversion #15727

convert : use reflinks for faster conversion #15727

compilade commented Sep 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

convert : use reflinks for faster conversion #15727

Are you sure you want to change the base?

convert : use reflinks for faster conversion #15727

Conversation

compilade commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Why this shouldn't be possible

Why this is possible

What works and what doesn't

TODO

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

compilade commented Sep 2, 2025 •

edited

Loading