Add support for GGUF files + KV cache from GGUF metadata by diegovelilla · Pull Request #25 · alvarobartt/hf-mem

diegovelilla · 2026-01-30T19:14:24Z

Description

This PR continues on top of the work of @vm7608 in #8. It aims to add support for GGUF files by:

Adding --gguf flag to separate .safetensors from .gguf estimations.
Adding support for new .gguf dtypes.
Parsing .gguf metadata to estimate both model and kv cache sizes.

(I'm starting it as a draft pull request to show the progress and explain decisions along)

I have read and followed the guidelines in CONTRIBUTING.md.
This has been discussed over an issue or discussion.

diegovelilla · 2026-01-30T19:18:54Z

src/hf_mem/types.py

I added this as a mapping since following the match/case approach I wouldn't be able to reuse any cases. All conversions have been taken either from the official HF docs or from the type declarations in the official ggml library.

It might be interesting to merge both dtype-to-bytes-per-weight functions or at least standardize them since rn one returns int and the other float.

alvarobartt · 2026-01-31T09:57:37Z

Cool @diegovelilla, do you think we could at least temporarily move the changes to gguf.py so that we have the GGUF stuff on a separate file?

Warning

Not to tackle in this PR, but sharing for visibility on my short term plans)

In the meantime I might think of a potential refactor to ease the things as in adding other formats so that we have a structure in the repository that contains a dedicated file for the CLI, then another for the httpx functions, then one per each file type (Safetensors and GGUF), and then other utils (and potentially also a lib that can be imported as from hf_mem import estimate; estimate(model_id=...).

diegovelilla · 2026-02-01T19:29:01Z

In this last commit I added the following:

New dataclasses for GGUFMetadata, GGUFComponentMetadata and GGUFDtypeMetadata since in these bytes_count is of type float. In the future these can be easily merged with the SafetensorsMetadata dataclasses to avoid redundancy and create a more general set of dataclasses.
Fetching function that dynamically fetches metadata when the initial one is not enough. I haven't seen any "metadata_length" field so this has to be done dynamically.
Parsing function that takes the raw_metadata from the fetch and returns a GGUFMetadata object. Following the C implementation in the official ggml docs.

I am not quite sure how the "components" layer of the json-output is supposed to work since most of the times it seems to default to "Transformer". As of now this code only uses 1 component with the name "Transformer" too.

Also, the _read_xxx functions could be condensed into one read function that also takes a str for the type and the number of bytes like _read(raw_metadata, offset, "I", 4) for reading a uint_32. I wasn't sure how would be better so I ran with different functions, however this can be easily reduced with that more general function.

As of now I tested with a couple models and it works. Only things left would be to add the kv-cache support, merge it with the code in cli.py and build the printing function.

diegovelilla · 2026-02-03T00:07:57Z

Looking good? @alvarobartt

Last up is adding the KV-Cache printing + optimizing the fetch of multiple files.

alvarobartt · 2026-02-03T15:54:38Z

That's great @diegovelilla, looks neat!

What do you think if we show the table with a simple per-file listing without the details, and then add an arg in the CLI as --gguf-file ... to select a particular file within a repository to get the details per dtype? I'm just thinking out loud, but thinking that the whole table might be a bit "too much"?

I'm not a super active GGUF user, so let me know otherwise.

alvarobartt · 2026-02-03T15:55:17Z

src/hf_mem/cli.py

    # TODO: `recursive=true` shouldn't really be required unless it's a Diffusers
    # models... I don't think this adds extra latency anyway
+    # NOTE: `recursive=true` is also need for GGUF file directories where each 
+    # sharded quantization is inside a different folder like: Q2_K/model_Q2_K-0001-of-0048.gguf


IMO we can remove both comments, no longer required!

Suggested change

# TODO: `recursive=true` shouldn't really be required unless it's a Diffusers

# models... I don't think this adds extra latency anyway

# NOTE: `recursive=true` is also need for GGUF file directories where each

# sharded quantization is inside a different folder like: Q2_K/model_Q2_K-0001-of-0048.gguf

diegovelilla · 2026-02-04T15:03:03Z

This is the new formatting.

For multiple GGUF files.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf

For multiple files with --experimental.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf --experimental

For a single GGUF file. (Reusing the .safetensors print function). Notice that in this case adding the --gguf flag is optional since you are already adding the --gguf-file

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf

For a single GGUF file with --experimental.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf --experimental

Also works with --json-output.

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf --experimental --json-output

[{"model_id": "deepseek-llm-7b-chat.Q2_K.gguf", "revision": "main", "components": {"Transformer": {"dtypes": {"Q2_K": {"param_count": 1426063360, "bytes_count": 467927040}, "Q3_K": {"param_count": 5064622080, "bytes_count": 2176204800}, "F32": {"param_count": 249856, "bytes_count": 999424}, "Q6_K": {"param_count": 419430400, "bytes_count": 344064000}}, "param_count": 6910365696, "bytes_count": 2989195264}}, "param_count": 6910365696, "bytes_count": 2989195264, "max_model_len": 4096, "cache_size": 2013265920, "batch_size": 1, "cache_dtype": "F16"}]

If this is okay, the only thing left would be make the calls asynchronously since fetching so many files makes is a bit slow.

Also could be useful to add it to the README?

diegovelilla · 2026-02-05T14:29:54Z

I finally added the asynchronous fetching to make the process faster.

Also added a new section to the README file but since all images are taken from what I assume it is you terminal I left a placeholder that needs to be replaced with the actual screenshot.

Lastly, since these GGUF files usually have very long names, it is not unusual to go over MAX_DATA_LEN and sometimes I get tables with wrong formattings without using --ignore-table-width. This comes with the --ignore-table-width warning which states that "the model is longer than 64 chars so the table will be expanded to fit each row". The result is something like this:

With the flag --ignore-table-width it gets correctly printed. If this is how it should work and it is not a bug then you can dismiss it, but I was curious since when coding it sounded as if the table was going to reshape itself but it actually just used min(max_length, MAX_DATA_LENGTH).

For the rest, I have rebased the branch to be in sync with main so it is ready to be merged after adding the screenshot to the README.

alvarobartt · 2026-02-11T09:32:00Z

Hey @diegovelilla thanks a lot for the effort! It does make sense, feel free to completely remove the --ignore-table-width flag in favour of doing that by default, not sure if we can do that within this PR given that you changed some of the code in print.py, or rather open another smaller PR to main to completely remove that and rather calculate the width dynamically when out of bounds? Up to you really, but I agree with you that defaulting to dynamic with + removing the flag might be cleaner 🤗

diegovelilla · 2026-02-12T15:31:27Z

Hey @alvarobartt should we then merge this PR as it is and create a new issue for the --ignore-table-width flag removal, given that as of now the tool still warns of possible unexpected table prints? I can create the issue explaining the problem and the changes but rn I'm a bit busy and idk if I can work on it.

alvarobartt

Hey again @diegovelilla, apologies for the delay, this is great!

Q: Do you think the --gguf flag is required? How common are repositories with both Safetensors and GGUF files? Can't we just skip the --gguf flag in favour of just checking which of those files is present in file_paths when listing the files in the repository? Then for GGUF files within a Safetensors repository, I'd just warn the user that if they'd like to run the estimate for those they should provide --gguf-file showing them the possible GGUF files in there, thoughts?

Thanks again in advance, this feature is going to be much appreciated by the community 🤗

diegovelilla · 2026-02-20T19:55:31Z

Hey @alvarobartt, now it should work without the--gguf flag. Filtering by libraries in the HF hub, only 6.860 repositories contain both GGUF and Safetensors files, so it is a pretty rare thing.

Now GGUF logic only applies if:

--gguf-file flag has been set to a GGUF filepath.
No Safetensors files have been found, but there are GGUF files.

In the case of parsing a repo that contains both, a warning is triggered, reminding that if they want to estimate any GGUF file, they have to set the --gguf-file flag to the desired filepath. A list of the GGUF filepaths can also be found in said warning. Then the execution continues for the Safetensors files estimations.

Edit: Branch rebased over current main (Feb 20th).

alvarobartt · 2026-02-23T20:44:17Z

README.md

+
+## GGUF Files
+
+By enabling the `--gguf` flag, you can estimate memory requirements for *.gguf* files. All files will be listed with their corresponding memory estimations. For a more in depth report like the one used for *.safetensors* files (with information regarding weight dtypes) the flag `--gguf-file` can be used to estimate a single GGUF model. For sharded files, the path to any of the individual shards will work.


Given that you recently removed the --gguf flag, should we update this blob here?

Yes, it is also missing the screenshot since all have been taken from what I assume is your terminal.

- Added functionality to check for GGUF files in the CLI and print a report using `print_report_for_gguf`. - Updated error handling to include GGUF files in the search criteria. - Introduced new helper functions in `print.py` for formatting and displaying GGUF file reports, including grouping sharded files and adjusting table widths. - Updated function `_bytes_to_gb` with `use_decimal` argument to match with Huggingface file size.

- Updated `_print_header`, `_print_centered`, `_print_divider`, `_format_name`, and `_print_row` functions to include an optional `name_len` parameter for improved flexibility in formatting. - Removed redundant GGUF-specific print functions, consolidating functionality into existing print methods. - Adjusted the `print_report_for_gguf` function to utilize the refactored print methods, enhancing code maintainability.

…rded files.

diegovelilla · 2026-02-23T22:22:23Z

Hey @alvarobartt, already changed the README, rebased over the last changes regarding version printing and added it to the gguf logic. Also now it shouldn't fail on precommit checks (mb). Just missing the screenshot from the README.md for the command:

hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf

alvarobartt · 2026-02-24T09:23:18Z

Hey @alvarobartt, already changed the README, rebased over the last changes regarding version printing and added it to the gguf logic. Also now it shouldn't fail on precommit checks (mb). Just missing the screenshot from the README.md for the command:
hf-mem --model-id TheBloke/deepseek-llm-7B-chat-GGUF --gguf-file deepseek-llm-7b-chat.Q2_K.gguf

Awesome @diegovelilla! Here you go (I've included the --experimental flag in your command above)

Also feel free to position the GGUF section in the README.md on top of the Anthropic Skills entry instead of below 🤗

diegovelilla · 2026-02-24T13:46:12Z

Should be done @alvarobartt

alvarobartt

Thanks a lot for the effort and the patience @diegovelilla 🤗

I'll merge as-is, and then likely push a couple more commits on top before releasing, but ideally trying to release mid next-week!

diegovelilla commented Jan 30, 2026

View reviewed changes

This was referenced Feb 2, 2026

Q: ideal quantized models (e.g. Q6, Q4, Ternary) #28

Closed

[FEATURE] Estimate VRAM for local safetensors files #2

Closed

diegovelilla force-pushed the support-gguf-kv-cache branch from 7f2bdb5 to a1c496e Compare February 2, 2026 18:22

alvarobartt reviewed Feb 3, 2026

View reviewed changes

diegovelilla force-pushed the support-gguf-kv-cache branch from d3436cb to a746e7f Compare February 5, 2026 14:13

diegovelilla marked this pull request as ready for review February 6, 2026 19:08

alvarobartt reviewed Feb 20, 2026

View reviewed changes

This was linked to issues Feb 20, 2026

[FEATURE] Estimate VRAM for local safetensors files #2

Closed

[FEATURE] Estimate VRAM for GGUF files #22

Closed

Q: ideal quantized models (e.g. Q6, Q4, Ternary) #28

Closed

alvarobartt mentioned this pull request Feb 20, 2026

Add GGUF file support in CLI and printing functions #8

Closed

diegovelilla force-pushed the support-gguf-kv-cache branch from 1f018c6 to af2529c Compare February 20, 2026 21:09

alvarobartt reviewed Feb 23, 2026

View reviewed changes

vm7608 and others added 7 commits February 23, 2026 23:11

Add gguf dtype mapping to bytes-per-weight

4692471

Move gguf functionality to separate file

687b6d0

Added GGUFDtypes Literal, specific metadata dataclasses and typeguard

0dd089a

Added fetching and parsing of gguf metadata + weights size estimation.

fc301a9

Added function to compute kv cache size.

8987533

diegovelilla added 16 commits February 23, 2026 23:11

Added printing for gguf files (without kv-cache) + fixed bugs for sha…

4c7f9f4

…rded files.

Added printing for kv-cache estimation.

2424ac0

Fix --experimental bug.

4320c5b

--experimental bug fix + removed ?recursive=True comments.

74643cd

Fixed --kv-cache-dtype bug and icompatibility with gguf files.

6b93f28

Added new formatting of the tables and fixed sharded files bug.

6919426

Fixed kv-cache printing bug.

a076ab5

Added asynchronous fetching with semaphore.

3728de5

Added GGUF section to README.

147a9b8

Fixed f string quoting.

57e9f05

Removed --gguf flag

fced11d

Added found GGUF filepaths to the warning.

9008b4c

Updated README with gguf

d36c019

Pre-commit fixes

be147d2

Added changes for __version__ commit

8bbec01

Added __version__ to json output for gguf

9fa7a43

diegovelilla force-pushed the support-gguf-kv-cache branch from a42b0c2 to 9fa7a43 Compare February 23, 2026 22:17

Update README with .gguf screenshot and moved over Agent skills

8bb8c34

alvarobartt approved these changes Mar 5, 2026

View reviewed changes

alvarobartt merged commit cd826ce into alvarobartt:main Mar 5, 2026
1 check passed


		## GGUF Files

		By enabling the `--gguf` flag, you can estimate memory requirements for .gguf files. All files will be listed with their corresponding memory estimations. For a more in depth report like the one used for .safetensors files (with information regarding weight dtypes) the flag `--gguf-file` can be used to estimate a single GGUF model. For sharded files, the path to any of the individual shards will work.

Conversation

diegovelilla commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

diegovelilla Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alvarobartt commented Jan 31, 2026

Uh oh!

diegovelilla commented Feb 1, 2026

Uh oh!

diegovelilla commented Feb 3, 2026

Uh oh!

alvarobartt commented Feb 3, 2026

Uh oh!

alvarobartt Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

diegovelilla commented Feb 4, 2026

Uh oh!

diegovelilla commented Feb 5, 2026

Uh oh!

alvarobartt commented Feb 11, 2026

Uh oh!

diegovelilla commented Feb 12, 2026

Uh oh!

alvarobartt left a comment

Choose a reason for hiding this comment

Uh oh!

diegovelilla commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alvarobartt Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

diegovelilla Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

diegovelilla commented Feb 23, 2026

Uh oh!

alvarobartt commented Feb 24, 2026

Uh oh!

diegovelilla commented Feb 24, 2026

Uh oh!

alvarobartt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

diegovelilla commented Jan 30, 2026 •

edited

Loading

diegovelilla Jan 30, 2026 •

edited

Loading

diegovelilla commented Feb 20, 2026 •

edited

Loading