Skip to content

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Sep 2, 2025

(targets #15667 because file ranges are required)

This adds a --reflink option to convert_hf_to_gguf.py to allow using Copy-On-Write features on some filesystems (BTRFS, XFS, and likely ZFS).

For the models where it works, it makes conversion extremely fast, and saves quite a lot of disk space (because most of the resulting model shares extents with the source safetensors files).

With --verbose there is additional logging for when reflinking falls back to a copy because of misalignment.

Warning

This is experimental; The models produced with the --reflink option can have incompatible alignment with what current mainline llama.cpp expect.
It might also produce broken models in some cases. Further testing is needed.

Results

Using --reflink with convert_hf_to_gguf.py will show the size after reflinking in the writing plan. This also works with --dry-run. Note that if the underlying filesystem or platform doesn't support reflinks, it will fallback to direct copies, but the size will still show as if reflinking was supported, even though it's not.

Model Size Unique size with --reflink % of original size
https://huggingface.co/allenai/OLMo-2-0325-32B-Instruct 64.5 GB 4.2 MB 0.0065 %
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2 17.8 GB 229.0 MB 1.27 %
https://huggingface.co/meta-llama/Llama-3.2-1B 2.5 GB 168.0 MB 6.72 %
https://huggingface.co/mistralai/Mistral-7B-v0.1 14.5 GB 1.3 GB 8.97 %
https://huggingface.co/bigscience/bloom-560m 1.1 GB 152.3 MB 13.8 %
https://huggingface.co/TheDrummer/GLM-Steam-106B-A12B-v1 221.0 GB 40.6 GB 18.4 %
https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.7 103.2 GB 26.3 GB 25.5 %

Some models are very easily reflinkable (e.g. OLMo-2-0325-32B-Instruct), while some are not (TODO: add more examples of poorly reflinkable models).
Generally, dense models which have no or very minimal tensor transformations in their modify_tensors part of the conversion should reflink really well.
For MoE models which require expert tensor stacking (e.g. Jamba, GLM-4.5-Air, etc.), part of the model is not reflinkable because of incompatible alignment of the stacked tensor parts. This is inevitable, and the best that can be done is to pick the most common alignment out of the stacked parts, and copy what can't be reflinked.
The other case when something isn't reflinkable is where file-range tracking isn't implemented, like permutes (which is used for some of the Attention tensors of Mistral-7B-v0.1) and splitting (which is used for Bloom).
And of course the 1D F32 tensors (e.g. the norms) aren't reflinked because they have a different type than the source file.

Why this shouldn't be possible

My first iteration of this used plain os.copy_file_range and directly gave it the file offsets. This did not work (compsize showed no significant sharing), because apparently the COW filesystems require the blocks to be aligned (for BOTH the source and the destination), or else no reflink will be made.

From https://manpages.debian.org/trixie/manpages-dev/FICLONERANGE.2const.en.html#EINVAL:

Disk filesystems generally require the offset and length arguments to be aligned to the fundamental block size.

Why this is possible

It's possible to "cheat" so that the block alignment offset of the destination file matches the source file. Obviously, this means a model converted with --reflink will not have the same alignment as one converted without --reflink.

This is only possible because of general.alignment which affects the alignment of the start of the data offsets. Otherwise aligning filesystem block offsets would be much more complicated, because the offsets would depend on the size of the metadata (which the offsets are part of). Technically, this is also possible the other way around, because safetensors format allows arbitrary padding for the metadata (which can be made aligned), but it would require a custom writer, and this is out of scope of this PR, only a fun fact.

I've also made BF16 not round-trip to F32 for easier file-range tracking, and made --outtype auto attempt to preserve the source types instead of guessing from the first tensor.

--outtype auto is the new default, because it's likely what most people expect when converting a model without specifying the type.

What works and what doesn't

File ranges are tracked for very simple operations, like type-views, reshapes, and stacking.

It's not yet implemented for tensor splits, but could be.

Fallback to a non-direct copy without reflinks is implemented and should be relatively robust.

For some models, not all the ranges of a stacked tensor can be copied with the same block alignment offset. In that case, the best one is used, but that means up to half of the stacked tensor is copied without reflink sharing.

Permutes, transposes, and similar are not supported (and probably won't be), and fallback to not tracking the file ranges.

Notably, GPT-OSS has transposed MoE tensors in its BF16 version, and so it doesn't really benefit from reflinks.

TODO

  • Test for correctness
  • Test on BTRFS
    • Data is shared according to compsize /path/to/model.gguf /path/to/model_dir
  • Test on ZFS
    • Does it work? Most resources online aren't clear about whether recent ZFS on Linux works with os.copy_file_range.
  • Test on XFS
  • Test on overlayfs (in docker/podman?)
  • Maybe track tensor splits
  • Figure out whether changing TENSOR_ALIGNMENT to 8 causes problems in backends which use that constant (cpu, repack, amx, kleidiai)
  • Cleaner handling of non-continuity in GGUF loading (to fix the gguf tests)
  • Test sharded reflinked model
  • Test pre-quantized conversion for regressions, and also the types in the logs

Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added demo Demonstrate some concept or idea, not intended to be merged python python script changes labels Sep 2, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 2, 2025
@compilade compilade force-pushed the compilade/convert-safetensors-parse branch from 786b32d to e582f1a Compare September 9, 2025 18:49
@compilade compilade force-pushed the compilade/convert-reflinks branch from 76d2ab2 to 833d03c Compare September 9, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

demo Demonstrate some concept or idea, not intended to be merged ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant