Skip to content

Conversation

igloo58
Copy link

@igloo58 igloo58 commented Aug 28, 2025

Summary

Fixes #15623. When converting very large HF models to GGUF, we sometimes OOM even with --use-temp-file. The main culprit is a NumPy astype(...) on huge tensors, which materializes a full-sized temporary array before writing.

This PR teaches the GGUF lazy writer to stream pure dtype casts to disk in fixed-size chunks, capping peak RAM during the write.

What’s changed

  • In gguf-py/gguf/lazy.py, detect lazy nodes that are only a dtype cast (astype) and, in tofile(), write them in chunks rather than materializing the whole array first.
  • New env knob: GGUF_CAST_CHUNK_MB (default 256) to control the chunk size.
    • Example (macOS/Linux):
      GGUF_CAST_CHUNK_MB=128 python -u convert_hf_to_gguf.py <in_dir> --outfile out.gguf --outtype f16 --use-temp-file
    • Example (Windows PowerShell):
      $env:GGUF_CAST_CHUNK_MB="128"
      python .\convert_hf_to_gguf.py <in_dir> --outfile out.gguf --outtype f16 --use-temp-file

Behavioral notes

  • No GGUF format changes; output is identical to the eager path.
  • Only affects NumPy astype-only nodes; complex ops still use the existing path.
  • If a tensor is already in the target dtype, no extra work is done.

Motivation / background

Even with --use-temp-file, the pipeline could still OOM on 100B+ models because astype creates a large transient array prior to streaming. The new path avoids that by converting and writing in slices.

Local results (M1 Pro, macOS 15 / Python 3.10):

  • TinyLlama-1.1B → F16: wrote 2.20 GB
    • maximum resident set size: ~2.51 GB
    • peak memory footprint: ~0.73 GB
  • Qwen2.5-7B-Instruct → F16: wrote 15.2 GB
    • maximum resident set size: ~4.44 GB
    • peak memory footprint: ~2.62 GB

These numbers reflect bounded peaks during cast/write; without streaming, peaks can jump much higher on large layers.

Limitations / follow-ups

  • Only streams pure dtype casts; future work could stream other large, simple transforms.
  • Defaults to 256 MB per chunk; reviewers can suggest a different default if preferred.

Testing

  • Converted multiple models end-to-end (--use-temp-file) and verified GGUF loads and runs as expected.
  • Manually varied GGUF_CAST_CHUNK_MB (128/256/512) to confirm peak memory scales with chunk size.

Thanks!

@github-actions github-actions bot added the python python script changes label Aug 28, 2025
@compilade
Copy link
Collaborator

compilade commented Aug 29, 2025

Chunking astype will only reduce memory usage in very simple cases where the tensors are not transformed before their type is changed, which excludes the case of MoE models which require stacking the experts tensors (because that doesn't directly involve astype). The stacked tensors are still materialized before being type-casted and/or written.

The max memory usage of the lazy writer should already be around (at most 3×) the size of the biggest tensor (because of BF16, and/or transformations which can't necessarily be streamed).

The peak extra memory used by astype (when not chunking writes), as a fraction of the input tensor size is likely is around:

From/To F32 F16 BF16 Q8_0
F32 0.5× 0.5× * 0.27× *
F16 1× * 0.53× *
BF16 2× * 3× * 3× * 2.53× *

When chunking astype, the peak extra memory usage would tend towards (in ideal conditions; still relative to the input tensor size):

From/To F32 F16 BF16 Q8_0
F32 0.5× * 0.27× *
F16 1× * 0.53× *
BF16 2× * 2× * 3× * 2.53× *

(* BF16 and Q8_0 don't use astype, and BF16 round-trips to F32 before .numpy() is called on the Torch tensor. Conversion back to BF16 is handled by gguf.quantize().)

The most common case where chunking astype would be beneficial is BF16 -> F16.
The current behavior for that case is BF16 -> F32 -> F16, and chunking astype would only save the last -> F16 part (which is 1x the size of the input tensor).

Avoiding the BF16 round-trip to F32 would save more memory than chunking astype (although I'm not sure to what extent avoiding going through F32 is possible). I think it's still worthwhile to attempt to reduce what is materialized (similarly to what you're suggesting), because ideally the conversion script should use no memory and be infinitely fast, and that would get it closer to there memory-wise.

Note that copy=False is used for F16 and F32 in gguf.quantize():

def quantize(data: np.ndarray, qtype: GGMLQuantizationType) -> np.ndarray:
if qtype == GGMLQuantizationType.F32:
return data.astype(np.float32, copy=False)
elif qtype == GGMLQuantizationType.F16:
return data.astype(np.float16, copy=False)
elif (q := _type_traits.get(qtype)) is not None:
return q.quantize(data)

Even with --use-temp-file, the pipeline could still OOM on 100B+ models because astype creates a large transient array prior to streaming.

Note that --use-temp-file doesn't really use less memory than lazy conversion (unless lazy conversion leaks), and is a legacy option which was kept in case lazy conversion misbehaves (its main use-case is with --no-lazy).

There could be a way to make something like what you're suggesting more general, though. I think the idea of minimizing what is materialized is interesting. Making it more general might require tracking file ranges (including when stacking MoE experts), and figuring out when they're invalidated, but it would allow avoiding the BF16 round-trip to F32 when possible (simpler solutions for that likely exist, though), and would notably allow using os.copy_file_range which could potentially lead to near-instantaneous conversion on Copy-On-Write filesystems.

@compilade
Copy link
Collaborator

compilade commented Aug 29, 2025

I've tested this with https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 (a BF16 MoE model) in the BF16 -> F16 case, measuring peak memory use with time -v (but not the built-in one from bash, this is GNU time)

On master:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/downloaded_model_dir --outfile Qwen3-30B-A3B-Instruct-2507-F16-master.gguf
...
        Maximum resident set size (kbytes): 9165048
...

With this PR:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/downloaded_model_dir --outfile Qwen3-30B-A3B-Instruct-2507-F16-chunked-astype.gguf
...
        Maximum resident set size (kbytes): 9164960
...

I'm not noticing a significant change. Both are using a peak of around 9GiB within 88 kilobytes of each other.

@igloo58 How did you measure peak memory usage? Is there another way to reliably measure that than with GNU time?

(on MacOS, I think the equivalent is /usr/bin/time -l, see https://stackoverflow.com/a/46874737)

EDIT: here's a table with more results. Still no significant reduction in memory usage.

Model Temp file Target type peak RSS on master (kbytes) peak RSS on this PR (kbytes) Delta (kbytes)
https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507 No F16 9 165 048 9 164 960 -88
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 No F16 2 653 788 2 666 408 +12 620
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 Yes F16 2 539 428 2 543 688 +4 260
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 No BF16 2 656 412 2 656 732 +320
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0 No Q8_0 2 653 976 2 654 532 +556

(note that small deltas are probably noise from fragmentation in the memory allocations)

@igloo58
Copy link
Author

igloo58 commented Aug 29, 2025

Hey @compilade thanks for the follow-up!

About your results
The ~±0–12 MB deltas you’re seeing make sense if the streaming path didn’t trigger. This PR only kicks in for pure NumPy astype(...) nodes that reach the lazy writer. In the cases you tested:
• Qwen3-30B-A3B (BF16→F16, MoE): tensors are stacked / BF16 round-tripped before write, so there’s no pure astype to stream.
• TinyLlama F16→F16 / BF16 / Q8_0: either no cast, or conversions happen before the writer (or via paths that don’t use np.astype). In those paths my change is effectively a no-op, which matches your table.

How I measured (macOS)

/usr/bin/time -l python3 convert_hf_to_gguf.py <model_dir> --outfile <out.gguf> 2>&1 | tee logs/run.txt
# I read both “maximum resident set size (kbytes)” and “peak memory footprint (bytes)”.

I’ll re-run on Linux with GNU time -v so we’re apples-to-apples with your numbers.

Next steps I’ll take
1. Force a pure cast to isolate the path and compare with GNU time -v, e.g.:
• Model originally F16 → run with --outtype f32 (exercises np.astype(np.float32)).
• Model originally F32 → run with --outtype f16 (exercises np.astype(np.float16)).
I’ll post before/after logs for both.
2. Add a debug knob (e.g. GGUF_DEBUG_STREAM=1) that logs when the streaming-cast path is used (dtype→dtype and chunk size). That will let us verify quickly whether a given model actually hits it.
3. Follow-up work (separate PR): prototype avoiding the BF16→F32 materialization (either preserve BF16 or stream BF16→F16 directly). That should help the MoE/BF16 case you tested far more than chunking astype.

If you’d like me to use a specific model to exercise the pure-cast path, I’m happy to match that exactly.

@compilade
Copy link
Collaborator

compilade commented Aug 29, 2025

@igloo58 Nice to learn that you did use time (even though it's the BSD one in MacOS).

GNU time doesn't have a "peak memory footprint" section, though, only "Maximum resident set size". I wonder if there's a way to get something similar in Linux.

Model originally F16 → run with --outtype f32 (exercises np.astype(np.float32)).

https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked is a model in F16.

Testing with GNU time, and converting that directly to F32 (which makes a 16 GiB file):

master:

$ $(which time) -v python3 convert_hf_to_gguf.py /srv/models/src/TriLM_3.9B_Unpacked/ --outfile /srv/models/tmp/TriLM_3.9B_Unpacked-F32-master.gguf --outtype f32
...
        Maximum resident set size (kbytes): 5389496
...

This PR:

$ $(which time) -v python3 convert_hf_to_gguf.py /srv/models/src/TriLM_3.9B_Unpacked/ --outfile /srv/models/tmp/TriLM_3.9B_Unpacked-F32-chunked-astype.gguf --outtype f32
...
        Maximum resident set size (kbytes): 5499200
...
Model Temp file Target type peak RSS on master (kbytes) peak RSS on this PR (kbytes) Delta (kbytes)
https://huggingface.co/SpectraSuite/TriLM_3.9B_Unpacked No F32 5 389 496 5 499 200 +109 704

There is no decrease in memory usage (there's an increase of 107MiB, but maybe it's noise). I guess this is because the chunk size of 256 MiB is too big to make the potential reduction from chunking noticeable.

Let's try with a chunk size of 16 MiB (by making mb always 16):

$ $(which time) -v python3 convert_hf_to_gguf.py /srv/models/src/TriLM_3.9B_Unpacked/ --outfile /srv/models/tmp/TriLM_3.9B_Unpacked-F32-chunked16MiB-astype.gguf --outtype f32
...
        Maximum resident set size (kbytes): 5382132
...

This is smaller than master, but only by around 7 MiB, which could be considered negligible.

I also checked that the code path for chunked astype is reached, and it is (at least in the F16 -> F32 case).

If you’d like me to use a specific model to exercise the pure-cast path, I’m happy to match that exactly.

If you have another suitable model, I can also match it. I think the best would be one in F16 and with a huge vocab so that the token embeddings tensor is enormous. The rest of the model doesn't have to be big. TriLM_3.9B only has 50688 tokens in its vocab, so it might not be ideal for this test.

@igloo58
Copy link
Author

igloo58 commented Aug 29, 2025

Hey @compilade thanks for the detailed follow-ups and for steering toward a pure-cast benchmark. I ran a focused set on Linux and the streaming path shows a clear drop in peak RSS.


Environment

  • OS: Ubuntu 24.04 (x86_64)

  • Python: 3.12.2

  • NumPy / PyTorch: 2.0.2 / 2.4.0

  • GNU time: /usr/bin/time -v

  • Branch: fix/gguf-stream-cast

  • TOKENIZERS_PARALLELISM=false


Model (pure-cast case)

  • bigscience/bloom-560m (F16) → --outtype f32

  • Link: https://huggingface.co/bigscience/bloom-560m

    Rationale: no MoE stacking; large vocab (≈250k) so the token embeddings tensor is big, and the path hits np.astype(np.float32).


Command

/usr/bin/time -v python3 convert_hf_to_gguf.py ~/models/bloom-560m \
  --outfile ~/gguf_out/bloom560m_f32_<tag>.gguf \
  --outtype f32
# for PR runs set:
#   GGUF_CAST_CHUNK_MB=<N>  (tested 8–256)

Results (Max RSS, kB)

run chunk Max RSS (kB) elapsed
bloom560m_f32_master.txt master 2,639,144 0:08.41
bloom560m_f32_pr_256mb.txt 256 MB 2,085,458 0:06.63
bloom560m_f32_pr_128mb.txt 128 MB 1,825,692 0:05.19
bloom560m_f32_pr_64mb.txt 64 MB 1,717,280 0:08.29
bloom560m_f32_pr_32mb.txt 32 MB 1,662,432 0:07.24
bloom560m_f32_pr_16mb.txt 16 MB 1,616,080 0:06.80
bloom560m_f32_pr_8mb.txt 8 MB 1,612,380 0:04.66

Takeaways

  • Streaming astype reduces peak RSS by ~21% (256 MB) to ~39% (8–16 MB) vs. master on this pure-cast workload.

  • Diminishing returns below ~32 MB; 32–64 MB looks like a sensible default trade-off here.

  • Elapsed times vary inconsistently (e.g., 256 MB faster than master, but 64 MB slower)—likely noise or I/O overhead from chunking; consider this a runtime trade-off at smaller chunks.

  • As we discussed, MoE/stacking and BF16 round-trip paths won’t benefit—this PR is intentionally scoped to the pure-cast case.


Happy to:

  • Flip the default chunk to 64 MB (or 32 MB) if you prefer.

  • Attach raw logs or re-run with any alternative model you suggest.

  • Follow up with a separate PR to avoid the BF16→F32 materialization where possible (or preserve BF16), per your notes.

@compilade
Copy link
Collaborator

compilade commented Aug 30, 2025

@igloo58 Good to see an example use-case with real effects! bloom-560m really does have a big vocab.

Regarding the default chunk size, I think it depends on how much RAM the users of the conversion script are expected to have. Before this change (or maybe with #15667), it was overhead + biggest converted tensor (assuming mmaped input tensor, otherwise also add that), and with this change it is closer to overhead + chunk size, at least in the cases where it's applicable (which means excluding MoE models which require stacking, because in that case the worst case is unchanged).

For an example of the overheads, with TriLM-3.9B, overhead seems to be around 200 MiB and comes from the size of torch, transformers, and also the 8.5 MiB vocab (according to a memray flamegraph).

I think 256 MiB is acceptable, but of course making it lower can be beneficial (unless it's so small the Python loop becomes a bottleneck). If you want to make it 64 MiB or 32 MiB, sure.


My main concern with this change is that the way it's implemented (wrapping astype and tofile) means it will break (aka be bypassed) with changes like #12837 in which tofile is not used anymore. An other concern is that it's only beneficial in very limited use-cases (although #15667 will likely make this PR have measurable impact in more cases).

I don't know how to keep both chunked astype as proposed here and multi-threaded writes (from #12837).

What would be needed (to make both work together) might be a way to have chunked lazy tensors? i.e. lazy tensors which can be materialized in chunks? Kind of a to_eager, but sliced? This sounds possible, but I'm not sure how (especially regarding interactions with permutes). I guess it would need to be able to fallback to a full materialization when chunking is not possible (or not obvious).

This is kind of related to the file-range tracking idea at the end of #15648 (comment), because both problems need some kind of mapping between the input tensor(s) and the materialized tensor, but the chunked lazy tensors problem doesn't necessarily require that the data comes from a file. Both problems need to be able to fallback to the normal behavior when they can't handle a particular transformation (and I think the problematic transformations would be the same or very similar for both problems). They have different goals though (file-range tracking wants the biggest chunks possible, while lazy chunking wants a limited chunk size (and maybe not all chunk sizes are possible, especially for quantized data)).

…debug logs

	•	New helper: gguf/stream_cast.py with write_cast(fp, src_arr, dst_dtype, chunk_mb) that writes src.astype(dst) in fixed-size chunks to cap peak RSS.
	•	lazy.py:
	•	tag LazyNumpyTensor.astype() results (_gguf_stream_cast, _gguf_stream_cast_dtype)
	•	tofile() streams via write_cast when the node is a pure dtype cast; otherwise falls back.
	•	env vars: GGUF_CAST_CHUNK_MB (default 64) and GGUF_STREAM_LOG (opt-in diagnostics).
	•	gguf_writer.py: call write_cast directly when the tensor is a tagged pure cast. This keeps the benefit even if future changes bypass tofile() / use multi-threaded writes.
	•	Alignment: preserve data_alignment by padding before/after writes.
	•	Repro notes: Ubuntu 24.04 / Python 3.12 / NumPy 2.1; bloom-560m FP16→F32 conversion shows peak RSS reductions when chunking (e.g., 256→64→32→16 MiB) with small runtime trade-offs at smaller chunks. macOS run logs confirm [gguf-stream] activation as well.
	•	Scope & limitations: only pure dtype casts; MoE stacking / complex transforms fall back.
	•	Future work (separate RFC/PR): “chunked lazy tensors” and file-range tracking compatible with multi-threaded writes.
@igloo58
Copy link
Author

igloo58 commented Sep 1, 2025

@compilade Here’s a quick update after your last note. Thanks again for the thoughtful review!

What I changed in this PR (now pushed):
• Dropped the default chunk size to 64 MiB (configurable via GGUF_CAST_CHUNK_MB; GGUF_STREAM_LOG enables debug prints).
• Kept the lazy.py tagging of pure dtype casts (astype sets _gguf_stream_cast / _gguf_stream_cast_dtype) and the tofile() fallback that streams when it’s a simple cast.
• New writer-side path: gguf_writer.py now detects those tagged pure-cast tensors and calls a shared helper gguf/stream_cast.py::write_cast(...) to stream the cast even when tofile() isn’t used. This should keep the benefit with #12837 / multi-threaded write scenarios.
• Preserved data_alignment by padding before/after streamed writes; non-pure casts fall back to the original behavior automatically.
• Added opt-in debug logs. Example from a run:

[gguf-stream] streaming cast write: chunk=64 MiB; dst=float32; shape=(250880, 1024)

(That’s the big token_embd.weight in bloom-560m.)

Results I’m seeing
• Peak RSS drops at smaller chunks, as expected; small runtime trade-offs are visible at the smallest chunks (Python loop/IO overhead), e.g. 256 MiB faster than master in one case while 64 MiB/32 MiB are a bit slower—so “runtime trade-offs visible at smaller chunks.”
• Verified on macOS and Linux with gnu time//usr/bin/time -v. Happy to paste full numbers if useful.

On your concerns
• “Will break if tofile() goes away?” — Addressed by the writer-side streaming path; we still keep the tofile() streaming as a fallback.
• “Only beneficial in limited cases” — Agreed; the implementation scopes strictly to pure dtype casts (common FP16→FP32 expands). MoE stacking / permutes / other transforms transparently fall back.
• “Chunked lazy tensors / file-range tracking” — I’m planning a separate RFC/PR to explore a sliced to_eager/lazy-chunk planner and how it could line up with the file-range idea from #15648, with graceful fallback when transforms make chunking non-obvious.

Questions for you:
1. API shape/naming ok? (_gguf_stream_cast tag + stream_cast.write_cast helper)
2. Anything else I should measure (TriLM-3.9B, different vocab sizes) to round out the benchmarks?

Thanks again! Open to any tweaks you’d like as well

except Exception:
mb = 64
tgt_dtype = getattr(tensor, "_gguf_stream_cast_dtype", src_arr.dtype)
_stream_log(f"writer: streaming cast (chunk={mb} MiB) dst={tgt_dtype} shape={getattr(tensor, 'shape', '?')}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using logger.debug instead of reinventing conditional debug logging.

Suggested change
_stream_log(f"writer: streaming cast (chunk={mb} MiB) dst={tgt_dtype} shape={getattr(tensor, 'shape', '?')}")
logger.debug(f"streaming cast (chunk={mb} MiB) dst={tgt_dtype} shape={getattr(tensor, 'shape', '?')}")

Comment on lines +240 to +242
def _slog(msg: str) -> None:
if os.environ.get("GGUF_STREAM_LOG"):
print(f"[gguf-stream] {msg}", file=sys.stdout, flush=True)
Copy link
Collaborator

@compilade compilade Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uses of _slog should also probably be replaced with logger.debug. Using an undocumented environment variable isn't very user-friendly. (unless the logs are not intended to be printed)

fout = self.fout[file_id]

# pop the first tensor info
# TODO: cleaner way to get the first key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to remove that comment. There's nothing which attempted to fix the stated TODO.

# align to data_alignment before writing tensor data
self.write_padding(fout, fout.tell())

# --- writer-side streaming for pure dtype casts (survives when tofile() isn't used) ---
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be simpler to keep using tofile. It would be more convenient at least. I think it's possible to make #12837 use tofile and that would remove the need to find an alternative to overriding tofile.

Comment on lines +249 to +252
setattr(out, "_gguf_stream_cast", True)
setattr(out, "_gguf_stream_cast_dtype", tgt)
# NEW: record the *source* lazy tensor for writer-side streaming
setattr(out, "_gguf_stream_cast_src", self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, a single attr could be used containing e.g. a tuple[LazyNumpyTensor, np.dtype], since they are expected to all exist when one of them does anyway.

Using three separate attrs seems excessive (especially since the existence one is redundant with the other ones existing).

This should also simplify (and remove the need for) most of the edge-case handling for missing values (e.g. the missing base array).

Comment on lines +290 to +292
# Install patches
LazyNumpyTensor.astype = _gguf_streaming_astype
LazyNumpyTensor.tofile = _gguf_streaming_tofile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to directly modify the source code of LazyNumpyTensor.astype and LazyNumpyTensor.tofile instead of patching them?

Unless you want this to be disable-able, in which case a subclass (although not sure what name to use) could also be appropriate, and then it could be used in LazyTorchTensor.numpy().

Are there cases where astype chunking shouldn't be used?

Assuming it's implemented correctly, I think the tag (used to detect whether to stream astype on chunks) will not be kept on transformations of the LazyNumpyTensor (because a new one is created to track the transformations), and so it should be safe in pretty much all cases.

end = min(start + ce, n)
# copy=False avoids an extra tmp when possible
chunk = flat[start:end].astype(dst, copy=False)
fout.write(mv(chunk).tobytes())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, directly use the memoryview of the np.ndarray, which is at np.ndarray.data.

Suggested change
fout.write(mv(chunk).tobytes())
fout.write(chunk.data)

I don't know if it would handle non-contiguous strides correctly, though, in this case the previous .reshape(-1) makes it contiguous anyway, so this should work.

There's also tofile :)

Suggested change
fout.write(mv(chunk).tobytes())
chunk.tofile(fout)

No idea of the performance difference of these approaches, but I think .tobytes() should be avoided because it returns "a copy of the raw contents of data memory".

Comment on lines +261 to +266
# default chunk size: 64 MiB (can override via GGUF_CAST_CHUNK_MB)
try:
mb = int(os.environ.get("GGUF_CAST_CHUNK_MB", "64") or "64")
except Exception:
mb = 64
mb = max(1, mb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to make this configurable at runtime? A default value should be fine here, I think?

Otherwise this is parsing environment variables at each written tensor. (very minor overhead, though)

@compilade
Copy link
Collaborator

compilade commented Sep 3, 2025

“Chunked lazy tensors / file-range tracking” — I’m planning a separate RFC/PR to explore a sliced to_eager/lazy-chunk planner and how it could line up with the file-range idea from #15648 (comment), with graceful fallback when transforms make chunking non-obvious.

@igloo58
I've implemented some form of file-range tracking in #15727; it handles tensor stacking, and so it works with MoE models.

This could be relatively easily adapted to direct copies (e.g. with shutil.copyfileobj and/or os.copy_file_range without necessarily aligning with filesystem blocks).

I've thought a bit about the usefulness and relatedness of lazy chunking and direct copying. It seems like there are separate use cases:

  1. Converting to the same type (e.g. BF16 -> BF16)
    • This is probably the most common use-case
    • Lazy chunking offers no technical improvement and only additional complexity compared to direct copying through file-range tracking
    • direct copying has very minimal memory requirements, since the tensors do not need to be materialized at all.
  2. Converting to a different type (e.g. BF16 -> F16)
    • Direct copying is impossible
    • Materialization has to happen (not necessarily all at once, but it all needs to be processed)
    • Lazy chunking can be beneficial here
      • Full tracking of chunks is complicated, although it's similar to file-range tracking, but with a source tensor instead of a source file, and with the big difference that data changes do not invalidate the ranges, and that reshapes put constraints on the chunk size.
      • Partial tracking with a base tensor (with matching data, but with a different type) would however be invalidated on other data changes, and would behave more closely to the file-range tracking, but might require handling cross-PyTorch-Numpy data.

There's a limit to useful complexity, though, and maybe general lazy chunking is too complicated for what it brings. The simpler approach you're taking would be acceptable (tagging astype), but I think cf20725 makes it a bit more complicated than it needs to be (particularly in gguf_writer.py).

API shape/naming ok? (_gguf_stream_cast tag + stream_cast.write_cast helper)

I think it's fine. But I'm not sure a completely different file is required, though. This feels like something which could be handled directly in lazy tensors instead of being patched-in. (assuming the suggested simplifications are considered)

Another thing which is a major point of the lazy tensors is that they are (almost) completely transparently handled; which means there is currently no lazy-tensor-specific code in gguf_writer, which helps reducing mental load when reading the code and/or making modifications (even the parallel writing case transparently uses the memoryview from np.ndarray.data which is (fully) materialized only when used).

“Will break if tofile() goes away?”

If it's simpler (and I suspect it would be), I think I can make #12837 use tofile. That would avoid unnecessarily leaking lazy tensor handling in gguf_writer.py, and will likely make it easier to handle custom writing more cleanly in the long term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: convert_hf_to_gguf.py runs out of memory

2 participants