- 
                Notifications
    You must be signed in to change notification settings 
- Fork 30.9k
[GGUF] Refactor and decouple gguf checkpoint loading logic #34385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| @SunMarc @MekkCyber for quantization - but let me know if someone else is handling the GGUF integration and I'll ping them instead in future! | 
| The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. | 
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
| Feel free to ping @Isotr0py and me for anything related to gguf @Rocketknight1  ! | 
| @Rocketknight1 I'm diving into the code of gguf integration to fix some issues so feel free to ping me as well for related issues and PRs | 
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
| BTW, since the GGUF tests are very slow, I used a script to do quick tests for GGUF tensors renaming (which is modified in this PR): import torch
from huggingface_hub import hf_hub_download
from transformers import AutoConfig, AutoModelForCausalLM
from transformers.modeling_gguf_pytorch_utils import load_gguf_checkpoint
model_id = "duyntnet/Qwen1.5-MoE-A2.7B-Chat-imatrix-GGUF"
gguf_model_id = "Qwen1.5-MoE-A2.7B-Chat-IQ1_M.gguf"
config = AutoConfig.from_pretrained(model_id, gguf_file=gguf_model_id)
with torch.device("meta"):
    model = AutoModelForCausalLM.from_config(config)
    expected_keys = set(model.state_dict().keys())
    gguf_path = hf_hub_download(model_id, filename=gguf_model_id)
    exact_keys = set(load_gguf_checkpoint(gguf_path, True, model)["tensors"].keys())
assert (
    expected_keys == exact_keys
), f"Following parameters are not initialized from GGUF file: {expected_keys - exact_keys}" | 
Signed-off-by: Isotr0py <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Thanks for this nice cleanup. It will make the maintenance + adding support to new models way easier. Did you check that if the tests are passing locally or not ? We can also trigger the slow tests if you prefer.
Signed-off-by: Isotr0py <[email protected]>
| Looks awesome ! More readable than it used to be ! Thanks for the clean up 🔥 | 
| if named_children := hf_model.named_children(): | ||
| for name, child in named_children: | ||
| sub_map = get_gguf_hf_weights_map(child, model_type, num_layers, qual_name=f"{qual_name}{name}.") | ||
| # Ignore the keys that are already in the main map to avoid overwriting | ||
| sub_map = {k: v for k, v in sub_map.items() if k not in gguf_to_hf_name_map} | ||
| gguf_to_hf_name_map.update(sub_map) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the whole recursion depth or only one layer of depth ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think depth=1 might be not enough, because some vision models based on AutoModelForVision2Seq might use AutoModelForCausalLM as language backbone. If they use BloomForCausalLM as backbone, this will require depth=2. (Although we won't face this case currently)
Since there is no significant slowdown in model loading between whole depth and depth=1, I think we can keep whole depth for code simplicity :)
BTW, about the model loading performance, the root bottleneck is the GGUFReader itself, because its implementation uses numpy memmap for metadata extraction incorrectly, which cause a terrible slowdown. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay thanks for the clarifications @Isotr0py 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the GGUFReader if you think there is a problem in the way it's handled I think it would be nice to open an issue in the gguf repo 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I have opened a PR in llama.cpp to improve the performance of GGUFReader: ggml-org/llama.cpp#10159 😄
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
| I have run slow tests for all architectures we supported (except falcon-40b) locally, and most of them have passed. Only  (Or we should re-organize the GGUF tests, because it's too messy 😅) | 
| Yes the  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM sorry for the delay, Christmas and new year vacation hit me! 🤗
| where N signifies the block number of a layer, and BB signifies the | ||
| attention/mlp layer components. | ||
| See "Standardized tensor names" in | ||
| https://github.com/ggerganov/ggml/blob/master/docs/gguf.md for details. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| https://github.com/ggerganov/ggml/blob/master/docs/gguf.md for details. | |
| https://github.com/ggerganov/ggml/blob/master/docs/gguf.md for details. We rely on the `get_tensor_name_map` to get the correct mapping! | 
…ce#34385) * draft load_gguf refactor * update Signed-off-by: Isotr0py <[email protected]> * remove llama mapping Signed-off-by: Isotr0py <[email protected]> * remove qwen2 mapping Signed-off-by: Isotr0py <[email protected]> * remove unused function Signed-off-by: Isotr0py <[email protected]> * deprecate stablelm mapping Signed-off-by: Isotr0py <[email protected]> * deprecate phi3 mapping Signed-off-by: Isotr0py <[email protected]> * deprecate t5 mapping Signed-off-by: Isotr0py <[email protected]> * deprecate bloom mapping Signed-off-by: Isotr0py <[email protected]> * fix bloom Signed-off-by: Isotr0py <[email protected]> * deprecate starcoder2 mapping Signed-off-by: Isotr0py <[email protected]> * deprecate gpt2 mapping Signed-off-by: Isotr0py <[email protected]> * deprecate mistral mapping Signed-off-by: Isotr0py <[email protected]> * deprecate nemotron mapping Signed-off-by: Isotr0py <[email protected]> * deprecate mamba mapping Signed-off-by: Isotr0py <[email protected]> * deprecate mamba mapping Signed-off-by: Isotr0py <[email protected]> * code format Signed-off-by: Isotr0py <[email protected]> * code format Signed-off-by: Isotr0py <[email protected]> * fix mamba Signed-off-by: Isotr0py <[email protected]> * fix qwen2moe Signed-off-by: Isotr0py <[email protected]> * remove qwen2moe mapping Signed-off-by: Isotr0py <[email protected]> * clean up Signed-off-by: Isotr0py <[email protected]> * remove falcon 7b map Signed-off-by: Isotr0py <[email protected]> * remove all ggml tensors mapping Signed-off-by: Isotr0py <[email protected]> * add comments Signed-off-by: Isotr0py <[email protected]> * update messages Signed-off-by: Isotr0py <[email protected]> * fix tensors in parsed parameters Signed-off-by: Isotr0py <[email protected]> * add gguf check Signed-off-by: Isotr0py <[email protected]> --------- Signed-off-by: Isotr0py <[email protected]>
What does this PR do?
Since there are more and more GGUF architectures have supported, the gguf loading logic (especially the
load_gguf_checkpointfunction) become complicated and couple tensor rename and tensor "reshape"/"permute" reverse.Therefore, I think there is a need to refactor the gguf checkpoint loading logic for better maintenance.
This PR is going to refactor and cleanup the gguf model loading logic with following methods:
get_gguf_hf_weights_mapfunction that has been used in vLLM to infergguf_weights_map(orGGUF_TENSOR_MAPPINGwe used intransformerscurrently) from model config withggufpackage automatically.GGUF_TENSOR_MAPPINGcan be inferred with above function, we can try to remove most of existing hardcoded mapping. And we don't need to ask contributor to add new mapping inGGUF_TENSOR_MAPPINGanymore at most of cases.Separate the GGUF tensors rename and "reshape"/"permute" reverse logic.Since vLLM and transformers have a different model weights loading logic, it wiil cost some time for me to figure out a better method to finish these works. So I marked this PR as a draft.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.