-
Notifications
You must be signed in to change notification settings - Fork 13.4k
gguf-py: reduce peak RAM during convert by streaming dtype casts #15648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… env knob; remove duplicate imports
Chunking The max memory usage of the lazy writer should already be around (at most 3×) the size of the biggest tensor (because of BF16, and/or transformations which can't necessarily be streamed). The peak extra memory used by
When chunking
(* BF16 and Q8_0 don't use The most common case where chunking Avoiding the BF16 round-trip to F32 would save more memory than chunking Note that llama.cpp/gguf-py/gguf/quants.py Lines 56 to 62 in e8d99dd
Note that There could be a way to make something like what you're suggesting more general, though. I think the idea of minimizing what is materialized is interesting. Making it more general might require tracking file ranges (including when stacking MoE experts), and figuring out when they're invalidated, but it would allow avoiding the BF16 round-trip to F32 when possible (simpler solutions for that likely exist, though), and would notably allow using |
I've tested this with https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 (a BF16 MoE model) in the On $ $(which time) -v python3 convert_hf_to_gguf.py /path/to/downloaded_model_dir --outfile Qwen3-30B-A3B-Instruct-2507-F16-master.gguf
...
Maximum resident set size (kbytes): 9165048
... With this PR: $ $(which time) -v python3 convert_hf_to_gguf.py /path/to/downloaded_model_dir --outfile Qwen3-30B-A3B-Instruct-2507-F16-chunked-astype.gguf
...
Maximum resident set size (kbytes): 9164960
... I'm not noticing a significant change. Both are using a peak of around 9GiB within 88 kilobytes of each other. @igloo58 How did you measure peak memory usage? Is there another way to reliably measure that than with GNU (on MacOS, I think the equivalent is EDIT: here's a table with more results. Still no significant reduction in memory usage.
(note that small deltas are probably noise from fragmentation in the memory allocations) |
Hey @compilade thanks for the follow-up! About your results How I measured (macOS)
I’ll re-run on Linux with GNU time -v so we’re apples-to-apples with your numbers. Next steps I’ll take If you’d like me to use a specific model to exercise the pure-cast path, I’m happy to match that exactly. |
@igloo58 Nice to learn that you did use GNU
https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked is a model in F16. Testing with GNU
$ $(which time) -v python3 convert_hf_to_gguf.py /srv/models/src/TriLM_3.9B_Unpacked/ --outfile /srv/models/tmp/TriLM_3.9B_Unpacked-F32-master.gguf --outtype f32
...
Maximum resident set size (kbytes): 5389496
... This PR: $ $(which time) -v python3 convert_hf_to_gguf.py /srv/models/src/TriLM_3.9B_Unpacked/ --outfile /srv/models/tmp/TriLM_3.9B_Unpacked-F32-chunked-astype.gguf --outtype f32
...
Maximum resident set size (kbytes): 5499200
...
There is no decrease in memory usage (there's an increase of 107MiB, but maybe it's noise). I guess this is because the chunk size of 256 MiB is too big to make the potential reduction from chunking noticeable. Let's try with a chunk size of 16 MiB (by making $ $(which time) -v python3 convert_hf_to_gguf.py /srv/models/src/TriLM_3.9B_Unpacked/ --outfile /srv/models/tmp/TriLM_3.9B_Unpacked-F32-chunked16MiB-astype.gguf --outtype f32
...
Maximum resident set size (kbytes): 5382132
... This is smaller than I also checked that the code path for chunked
If you have another suitable model, I can also match it. I think the best would be one in F16 and with a huge vocab so that the token embeddings tensor is enormous. The rest of the model doesn't have to be big. TriLM_3.9B only has 50688 tokens in its vocab, so it might not be ideal for this test. |
Hey @compilade thanks for the detailed follow-ups and for steering toward a pure-cast benchmark. I ran a focused set on Linux and the streaming path shows a clear drop in peak RSS. Environment
Model (pure-cast case)
Command
Results (Max RSS, kB)
Takeaways
Happy to:
|
@igloo58 Good to see an example use-case with real effects! Regarding the default chunk size, I think it depends on how much RAM the users of the conversion script are expected to have. Before this change (or maybe with #15667), it was For an example of the overheads, with TriLM-3.9B, I think My main concern with this change is that the way it's implemented (wrapping I don't know how to keep both chunked What would be needed (to make both work together) might be a way to have chunked lazy tensors? i.e. lazy tensors which can be materialized in chunks? Kind of a This is kind of related to the file-range tracking idea at the end of #15648 (comment), because both problems need some kind of mapping between the input tensor(s) and the materialized tensor, but the chunked lazy tensors problem doesn't necessarily require that the data comes from a file. Both problems need to be able to fallback to the normal behavior when they can't handle a particular transformation (and I think the problematic transformations would be the same or very similar for both problems). They have different goals though (file-range tracking wants the biggest chunks possible, while lazy chunking wants a limited chunk size (and maybe not all chunk sizes are possible, especially for quantized data)). |
…debug logs • New helper: gguf/stream_cast.py with write_cast(fp, src_arr, dst_dtype, chunk_mb) that writes src.astype(dst) in fixed-size chunks to cap peak RSS. • lazy.py: • tag LazyNumpyTensor.astype() results (_gguf_stream_cast, _gguf_stream_cast_dtype) • tofile() streams via write_cast when the node is a pure dtype cast; otherwise falls back. • env vars: GGUF_CAST_CHUNK_MB (default 64) and GGUF_STREAM_LOG (opt-in diagnostics). • gguf_writer.py: call write_cast directly when the tensor is a tagged pure cast. This keeps the benefit even if future changes bypass tofile() / use multi-threaded writes. • Alignment: preserve data_alignment by padding before/after writes. • Repro notes: Ubuntu 24.04 / Python 3.12 / NumPy 2.1; bloom-560m FP16→F32 conversion shows peak RSS reductions when chunking (e.g., 256→64→32→16 MiB) with small runtime trade-offs at smaller chunks. macOS run logs confirm [gguf-stream] activation as well. • Scope & limitations: only pure dtype casts; MoE stacking / complex transforms fall back. • Future work (separate RFC/PR): “chunked lazy tensors” and file-range tracking compatible with multi-threaded writes.
@compilade Here’s a quick update after your last note. Thanks again for the thoughtful review! What I changed in this PR (now pushed): [gguf-stream] streaming cast write: chunk=64 MiB; dst=float32; shape=(250880, 1024) (That’s the big token_embd.weight in bloom-560m.) Results I’m seeing On your concerns Questions for you: Thanks again! Open to any tweaks you’d like as well |
except Exception: | ||
mb = 64 | ||
tgt_dtype = getattr(tensor, "_gguf_stream_cast_dtype", src_arr.dtype) | ||
_stream_log(f"writer: streaming cast (chunk={mb} MiB) dst={tgt_dtype} shape={getattr(tensor, 'shape', '?')}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using logger.debug
instead of reinventing conditional debug logging.
_stream_log(f"writer: streaming cast (chunk={mb} MiB) dst={tgt_dtype} shape={getattr(tensor, 'shape', '?')}") | |
logger.debug(f"streaming cast (chunk={mb} MiB) dst={tgt_dtype} shape={getattr(tensor, 'shape', '?')}") |
def _slog(msg: str) -> None: | ||
if os.environ.get("GGUF_STREAM_LOG"): | ||
print(f"[gguf-stream] {msg}", file=sys.stdout, flush=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uses of _slog
should also probably be replaced with logger.debug
. Using an undocumented environment variable isn't very user-friendly. (unless the logs are not intended to be printed)
fout = self.fout[file_id] | ||
|
||
# pop the first tensor info | ||
# TODO: cleaner way to get the first key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to remove that comment. There's nothing which attempted to fix the stated TODO.
# align to data_alignment before writing tensor data | ||
self.write_padding(fout, fout.tell()) | ||
|
||
# --- writer-side streaming for pure dtype casts (survives when tofile() isn't used) --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be simpler to keep using tofile
. It would be more convenient at least. I think it's possible to make #12837 use tofile
and that would remove the need to find an alternative to overriding tofile
.
setattr(out, "_gguf_stream_cast", True) | ||
setattr(out, "_gguf_stream_cast_dtype", tgt) | ||
# NEW: record the *source* lazy tensor for writer-side streaming | ||
setattr(out, "_gguf_stream_cast_src", self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, a single attr could be used containing e.g. a tuple[LazyNumpyTensor, np.dtype]
, since they are expected to all exist when one of them does anyway.
Using three separate attrs seems excessive (especially since the existence one is redundant with the other ones existing).
This should also simplify (and remove the need for) most of the edge-case handling for missing values (e.g. the missing base array).
# Install patches | ||
LazyNumpyTensor.astype = _gguf_streaming_astype | ||
LazyNumpyTensor.tofile = _gguf_streaming_tofile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to directly modify the source code of LazyNumpyTensor.astype
and LazyNumpyTensor.tofile
instead of patching them?
Unless you want this to be disable-able, in which case a subclass (although not sure what name to use) could also be appropriate, and then it could be used in LazyTorchTensor.numpy()
.
Are there cases where astype
chunking shouldn't be used?
Assuming it's implemented correctly, I think the tag (used to detect whether to stream astype
on chunks) will not be kept on transformations of the LazyNumpyTensor
(because a new one is created to track the transformations), and so it should be safe in pretty much all cases.
end = min(start + ce, n) | ||
# copy=False avoids an extra tmp when possible | ||
chunk = flat[start:end].astype(dst, copy=False) | ||
fout.write(mv(chunk).tobytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, directly use the memoryview
of the np.ndarray
, which is at np.ndarray.data
.
fout.write(mv(chunk).tobytes()) | |
fout.write(chunk.data) |
I don't know if it would handle non-contiguous strides correctly, though, in this case the previous .reshape(-1)
makes it contiguous anyway, so this should work.
There's also tofile
:)
fout.write(mv(chunk).tobytes()) | |
chunk.tofile(fout) |
No idea of the performance difference of these approaches, but I think .tobytes()
should be avoided because it returns "a copy of the raw contents of data memory".
# default chunk size: 64 MiB (can override via GGUF_CAST_CHUNK_MB) | ||
try: | ||
mb = int(os.environ.get("GGUF_CAST_CHUNK_MB", "64") or "64") | ||
except Exception: | ||
mb = 64 | ||
mb = max(1, mb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to make this configurable at runtime? A default value should be fine here, I think?
Otherwise this is parsing environment variables at each written tensor. (very minor overhead, though)
@igloo58 This could be relatively easily adapted to direct copies (e.g. with I've thought a bit about the usefulness and relatedness of lazy chunking and direct copying. It seems like there are separate use cases:
There's a limit to useful complexity, though, and maybe general lazy chunking is too complicated for what it brings. The simpler approach you're taking would be acceptable (tagging
I think it's fine. But I'm not sure a completely different file is required, though. This feels like something which could be handled directly in lazy tensors instead of being patched-in. (assuming the suggested simplifications are considered) Another thing which is a major point of the lazy tensors is that they are (almost) completely transparently handled; which means there is currently no lazy-tensor-specific code in
If it's simpler (and I suspect it would be), I think I can make #12837 use |
Summary
Fixes #15623. When converting very large HF models to GGUF, we sometimes OOM even with
--use-temp-file
. The main culprit is a NumPyastype(...)
on huge tensors, which materializes a full-sized temporary array before writing.This PR teaches the GGUF lazy writer to stream pure dtype casts to disk in fixed-size chunks, capping peak RAM during the write.
What’s changed
gguf-py/gguf/lazy.py
, detect lazy nodes that are only a dtype cast (astype
) and, intofile()
, write them in chunks rather than materializing the whole array first.GGUF_CAST_CHUNK_MB
(default 256) to control the chunk size.Behavioral notes
astype
-only nodes; complex ops still use the existing path.Motivation / background
Even with
--use-temp-file
, the pipeline could still OOM on 100B+ models becauseastype
creates a large transient array prior to streaming. The new path avoids that by converting and writing in slices.Local results (M1 Pro, macOS 15 / Python 3.10):
maximum resident set size
: ~2.51 GBpeak memory footprint
: ~0.73 GBmaximum resident set size
: ~4.44 GBpeak memory footprint
: ~2.62 GBThese numbers reflect bounded peaks during cast/write; without streaming, peaks can jump much higher on large layers.
Limitations / follow-ups
Testing
--use-temp-file
) and verified GGUF loads and runs as expected.GGUF_CAST_CHUNK_MB
(128/256/512) to confirm peak memory scales with chunk size.Thanks!