Skip to content

Add support for GGUF files + KV cache from GGUF metadata#25

Merged
alvarobartt merged 24 commits intoalvarobartt:mainfrom
diegovelilla:support-gguf-kv-cache
Mar 5, 2026
Merged

Add support for GGUF files + KV cache from GGUF metadata#25
alvarobartt merged 24 commits intoalvarobartt:mainfrom
diegovelilla:support-gguf-kv-cache

Conversation

@diegovelilla
Copy link
Contributor

@diegovelilla diegovelilla commented Jan 30, 2026

Description

This PR continues on top of the work of @vm7608 in #8. It aims to add support for GGUF files by:

  • Adding --gguf flag to separate .safetensors from .gguf estimations.
  • Adding support for new .gguf dtypes.
  • Parsing .gguf metadata to estimate both model and kv cache sizes.

(I'm starting it as a draft pull request to show the progress and explain decisions along)


  • I have read and followed the guidelines in CONTRIBUTING.md.
  • This has been discussed over an issue or discussion.

Copy link
Contributor Author

@diegovelilla diegovelilla Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this as a mapping since following the match/case approach I wouldn't be able to reuse any cases. All conversions have been taken either from the official HF docs or from the type declarations in the official ggml library.


It might be interesting to merge both dtype-to-bytes-per-weight functions or at least standardize them since rn one returns int and the other float.

@alvarobartt
Copy link
Owner

Cool @diegovelilla, do you think we could at least temporarily move the changes to gguf.py so that we have the GGUF stuff on a separate file?

Warning

Not to tackle in this PR, but sharing for visibility on my short term plans)

In the meantime I might think of a potential refactor to ease the things as in adding other formats so that we have a structure in the repository that contains a dedicated file for the CLI, then another for the httpx functions, then one per each file type (Safetensors and GGUF), and then other utils (and potentially also a lib that can be imported as from hf_mem import estimate; estimate(model_id=...).

@diegovelilla
Copy link
Contributor Author

In this last commit I added the following:

  • New dataclasses for GGUFMetadata, GGUFComponentMetadata and GGUFDtypeMetadata since in these bytes_count is of type float. In the future these can be easily merged with the SafetensorsMetadata dataclasses to avoid redundancy and create a more general set of dataclasses.

  • Fetching function that dynamically fetches metadata when the initial one is not enough. I haven't seen any "metadata_length" field so this has to be done dynamically.

  • Parsing function that takes the raw_metadata from the fetch and returns a GGUFMetadata object. Following the C implementation in the official ggml docs.


I am not quite sure how the "components" layer of the json-output is supposed to work since most of the times it seems to default to "Transformer". As of now this code only uses 1 component with the name "Transformer" too.

Also, the _read_xxx functions could be condensed into one read function that also takes a str for the type and the number of bytes like _read(raw_metadata, offset, "I", 4) for reading a uint_32. I wasn't sure how would be better so I ran with different functions, however this can be easily reduced with that more general function.


As of now I tested with a couple models and it works. Only things left would be to add the kv-cache support, merge it with the code in cli.py and build the printing function.

@diegovelilla
Copy link
Contributor Author

Looking good? @alvarobartt

Screenshot from 2026-02-03 00-32-09

Last up is adding the KV-Cache printing + optimizing the fetch of multiple files.

@alvarobartt
Copy link
Owner

That's great @diegovelilla, looks neat!

What do you think if we show the table with a simple per-file listing without the details, and then add an arg in the CLI as --gguf-file ... to select a particular file within a repository to get the details per dtype? I'm just thinking out loud, but thinking that the whole table might be a bit "too much"?

I'm not a super active GGUF user, so let me know otherwise.

Comment on lines +121 to +127
# TODO: `recursive=true` shouldn't really be required unless it's a Diffusers
# models... I don't think this adds extra latency anyway
# NOTE: `recursive=true` is also need for GGUF file directories where each
# sharded quantization is inside a different folder like: Q2_K/model_Q2_K-0001-of-0048.gguf
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we can remove both comments, no longer required!

Suggested change
# TODO: `recursive=true` shouldn't really be required unless it's a Diffusers
# models... I don't think this adds extra latency anyway
# NOTE: `recursive=true` is also need for GGUF file directories where each
# sharded quantization is inside a different folder like: Q2_K/model_Q2_K-0001-of-0048.gguf

@diegovelilla
Copy link
Contributor Author

This is the new formatting.


For multiple GGUF files.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf
Screenshot from 2026-02-04 15-41-52

For multiple files with --experimental.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf --experimental
Screenshot from 2026-02-04 15-54-44

For a single GGUF file. (Reusing the .safetensors print function). Notice that in this case adding the --gguf flag is optional since you are already adding the --gguf-file

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf 
Screenshot from 2026-02-04 15-43-04

For a single GGUF file with --experimental.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf --experimental
Screenshot from 2026-02-04 15-43-19

Also works with --json-output.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf --experimental --json-output
[{"model_id": "deepseek-llm-7b-chat.Q2_K.gguf", "revision": "main", "components": {"Transformer": {"dtypes": {"Q2_K": {"param_count": 1426063360, "bytes_count": 467927040}, "Q3_K": {"param_count": 5064622080, "bytes_count": 2176204800}, "F32": {"param_count": 249856, "bytes_count": 999424}, "Q6_K": {"param_count": 419430400, "bytes_count": 344064000}}, "param_count": 6910365696, "bytes_count": 2989195264}}, "param_count": 6910365696, "bytes_count": 2989195264, "max_model_len": 4096, "cache_size": 2013265920, "batch_size": 1, "cache_dtype": "F16"}]

If this is okay, the only thing left would be make the calls asynchronously since fetching so many files makes is a bit slow.

Also could be useful to add it to the README?

@diegovelilla diegovelilla force-pushed the support-gguf-kv-cache branch from d3436cb to a746e7f Compare February 5, 2026 14:13
@diegovelilla
Copy link
Contributor Author

I finally added the asynchronous fetching to make the process faster.

Also added a new section to the README file but since all images are taken from what I assume it is you terminal I left a placeholder that needs to be replaced with the actual screenshot.

Lastly, since these GGUF files usually have very long names, it is not unusual to go over MAX_DATA_LEN and sometimes I get tables with wrong formattings without using --ignore-table-width. This comes with the --ignore-table-width warning which states that "the model is longer than 64 chars so the table will be expanded to fit each row". The result is something like this:

Screenshot from 2026-02-05 15-08-33

With the flag --ignore-table-width it gets correctly printed. If this is how it should work and it is not a bug then you can dismiss it, but I was curious since when coding it sounded as if the table was going to reshape itself but it actually just used min(max_length, MAX_DATA_LENGTH).

For the rest, I have rebased the branch to be in sync with main so it is ready to be merged after adding the screenshot to the README.

@diegovelilla diegovelilla marked this pull request as ready for review February 6, 2026 19:08
@alvarobartt
Copy link
Owner

Hey @diegovelilla thanks a lot for the effort! It does make sense, feel free to completely remove the --ignore-table-width flag in favour of doing that by default, not sure if we can do that within this PR given that you changed some of the code in print.py, or rather open another smaller PR to main to completely remove that and rather calculate the width dynamically when out of bounds? Up to you really, but I agree with you that defaulting to dynamic with + removing the flag might be cleaner 🤗

@diegovelilla
Copy link
Contributor Author

Hey @alvarobartt should we then merge this PR as it is and create a new issue for the --ignore-table-width flag removal, given that as of now the tool still warns of possible unexpected table prints? I can create the issue explaining the problem and the changes but rn I'm a bit busy and idk if I can work on it.

Copy link
Owner

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey again @diegovelilla, apologies for the delay, this is great!

Q: Do you think the --gguf flag is required? How common are repositories with both Safetensors and GGUF files? Can't we just skip the --gguf flag in favour of just checking which of those files is present in file_paths when listing the files in the repository? Then for GGUF files within a Safetensors repository, I'd just warn the user that if they'd like to run the estimate for those they should provide --gguf-file showing them the possible GGUF files in there, thoughts?

Thanks again in advance, this feature is going to be much appreciated by the community 🤗

@diegovelilla
Copy link
Contributor Author

diegovelilla commented Feb 20, 2026

Hey @alvarobartt, now it should work without the--gguf flag. Filtering by libraries in the HF hub, only 6.860 repositories contain both GGUF and Safetensors files, so it is a pretty rare thing.

Now GGUF logic only applies if:

  1. --gguf-file flag has been set to a GGUF filepath.
  2. No Safetensors files have been found, but there are GGUF files.

In the case of parsing a repo that contains both, a warning is triggered, reminding that if they want to estimate any GGUF file, they have to set the --gguf-file flag to the desired filepath. A list of the GGUF filepaths can also be found in said warning. Then the execution continues for the Safetensors files estimations.

Edit: Branch rebased over current main (Feb 20th).

README.md Outdated

## GGUF Files

By enabling the `--gguf` flag, you can estimate memory requirements for *.gguf* files. All files will be listed with their corresponding memory estimations. For a more in depth report like the one used for *.safetensors* files (with information regarding weight dtypes) the flag `--gguf-file` can be used to estimate a single GGUF model. For sharded files, the path to any of the individual shards will work.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you recently removed the --gguf flag, should we update this blob here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is also missing the screenshot since all have been taken from what I assume is your terminal.

vm7608 and others added 7 commits February 23, 2026 23:11
- Added functionality to check for GGUF files in the CLI and print a report using `print_report_for_gguf`.
- Updated error handling to include GGUF files in the search criteria.
- Introduced new helper functions in `print.py` for formatting and displaying GGUF file reports, including grouping sharded files and adjusting table widths.
- Updated function `_bytes_to_gb` with `use_decimal` argument to match with Huggingface file size.
- Updated `_print_header`, `_print_centered`, `_print_divider`, `_format_name`, and `_print_row` functions to include an optional `name_len` parameter for improved flexibility in formatting.
- Removed redundant GGUF-specific print functions, consolidating functionality into existing print methods.
- Adjusted the `print_report_for_gguf` function to utilize the refactored print methods, enhancing code maintainability.
@diegovelilla
Copy link
Contributor Author

Hey @alvarobartt, already changed the README, rebased over the last changes regarding version printing and added it to the gguf logic. Also now it shouldn't fail on precommit checks (mb). Just missing the screenshot from the README.md for the command:

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf

@alvarobartt
Copy link
Owner

Hey @alvarobartt, already changed the README, rebased over the last changes regarding version printing and added it to the gguf logic. Also now it shouldn't fail on precommit checks (mb). Just missing the screenshot from the README.md for the command:

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf

Awesome @diegovelilla! Here you go (I've included the --experimental flag in your command above)

image

Also feel free to position the GGUF section in the README.md on top of the Anthropic Skills entry instead of below 🤗

@diegovelilla
Copy link
Contributor Author

Should be done @alvarobartt

Copy link
Owner

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the effort and the patience @diegovelilla 🤗

I'll merge as-is, and then likely push a couple more commits on top before releasing, but ideally trying to release mid next-week!

@alvarobartt alvarobartt merged commit cd826ce into alvarobartt:main Mar 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Q: ideal quantized models (e.g. Q6, Q4, Ternary) [FEATURE] Estimate VRAM for GGUF files [FEATURE] Estimate VRAM for local safetensors files

3 participants