-
Notifications
You must be signed in to change notification settings - Fork 154
Support GLM-4-0414 models based on piDack's mainline PR #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Okay, after some more testing it seems to be working with CPU backend, but not with CUDA. Q4_0 quantization successcustom="
# Token embedding and output tensors
token_embd\.weight=q4_0
output\.weight=q4_0
output_norm\.weight=q4_0
# TODO customize layers based on cosine similarity layer importance scores
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
# wtf is: --ignore-imatrix-rules ?? doesn't exist?
./build/bin/llama-quantize \
--token-embedding-type q4_0 \
--output-tensor-type q4_0 \
--custom-q "$custom" \
/mnt/raid/models/ubergarm/GLM-Z1-Rumination-32B-0414-GGUF/GLM-Z1-Rumination-32B-0414-BF16-00001-of-00002.gguf \
/mnt/raid/models/ubergarm/GLM-Z1-Rumination-32B-0414-GGUF/GLM-Z1-Rumination-32B-0414-Q4_0.gguf \
Q4_0 \
24
.
.
.
[ 52/ 613] blk.5.attn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 53/ 613] blk.5.ffn_down.weight - [23040, 6144, 1, 1], type = bf16, converting to q4_0 .. size = 270.00 MiB -> 75.94 MiB
[ 54/ 613] blk.5.ffn_up.weight - [ 6144, 46080, 1, 1], type = bf16, converting to q4_0 .. size = 540.00 MiB -> 151.88 MiB
[ 55/ 613] blk.5.ffn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 56/ 613] blk.5.post_ffw_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 57/ 613] blk.5.post_attention_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 58/ 613] blk.5.attn_k.weight - [ 6144, 1024, 1, 1], type = bf16, converting to q4_0 .. size = 12.00 MiB -> 3.38 MiB
[ 59/ 613] blk.5.attn_output.weight - [ 6144, 6144, 1, 1], type = bf16, Using custom type q4_0 for tensor blk.5.attn_output.weight
converting to q4_0 .. size = 72.00 MiB -> 20.25 MiB
[ 60/ 613] blk.5.attn_q.weight - [ 6144, 6144, 1, 1], type = bf16, converting to q4_0 .. size = 72.00 MiB -> 20.25 MiB
[ 61/ 613] blk.5.attn_v.weight - [ 6144, 1024, 1, 1], type = bf16, converting to q4_0 .. size = 12.00 MiB -> 3.38 MiB
[ 62/ 613] blk.6.attn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 63/ 613] blk.6.ffn_down.weight - [23040, 6144, 1, 1], type = bf16, converting to q4_0 .. size = 270.00 MiB -> 75.94 MiB
[ 64/ 613] blk.6.ffn_up.weight - [ 6144, 46080, 1, 1], type = bf16, converting to q4_0 .. size = 540.00 MiB -> 151.88 MiB
[ 65/ 613] blk.6.ffn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 66/ 613] blk.6.post_ffw_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 67/ 613] blk.6.post_attention_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
[ 68/ 613] blk.6.attn_k.weight - [ 6144, 1024, 1, 1], type = bf16, converting to q4_0 .. size = 12.00 MiB -> 3.38 MiB
[ 69/ 613] blk.6.attn_output.weight - [ 6144, 6144, 1, 1], type = bf16, Using custom type q4_0 for tensor blk.6.attn_output.weight
converting to q4_0 .. size = 72.00 MiB -> 20.25 MiB
[ 70/ 613] blk.6.attn_q.weight - [ 6144, 6144, 1, 1], type = bf16, converting to q4_0 .. size = 72.00 MiB -> 20.25 MiB
[ 71/ 613] blk.6.attn_v.weight - [ 6144, 1024, 1, 1], type = bf16, converting to q4_0 .. size = 12.00 MiB -> 3.38 MiB
.
.
.
[ 613/ 613] output_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
llama_model_quantize_internal: model size = 63215.74 MB
llama_model_quantize_internal: quant size = 17783.55 MBCUDA inference test fails$ CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-cli \
--alias ubergarm/GLM-Z1-Rumination-32B-0414-Q4_0 \
--model /mnt/raid/models/ubergarm/GLM-Z1-Rumination-32B-0414-GGUF/GLM-Z1-Rumination-32B-0414-Q4_0.gguf \
--ctx-size 8192 \
--parallel 1 \
--n-gpu-layers 62 \
--prompt "The meaning of life is" \
--threads 24
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.56 MiB
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 499.50 MiB
llm_load_tensors: CUDA0 buffer size = 17284.05 MiB
.................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1952.00 MiB
llama_new_context_with_model: KV self size = 1952.00 MiB, K (f16): 976.00 MiB, V (f16): 976.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 832.00 MiB
ggml_gallocr_reserve_n: reallocating CUDA_Host buffer from size 0.00 MiB to 28.01 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 832.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 28.01 MiB
llama_new_context_with_model: graph nodes = 1835
llama_new_context_with_model: graph splits = 2
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE =
0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0
The meaning of life is
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [ffn_inp-0] [6144 5 1 1]
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
llama_print_timings: load time = 1278.26 ms
llama_print_timings: sample time = 17.28 ms / 51 runs ( 0.34 ms per token, 2951.56 tokens per second)
llama_print_timings: prompt eval time = 44.63 ms / 5 tokens ( 8.93 ms per token, 112.04 tokens per second)
llama_print_timings: eval time = 1545.17 ms / 50 runs ( 30.90 ms per token, 32.36 tokens per second)
llama_print_timings: total time = 1630.87 ms / 55 tokensCPU inference seems okay with quick testNOTE: While it generates valid looking output, it behaves differently than running the same quant on mainline e.g. no $ ./build/bin/llama-cli \
--alias ubergarm/GLM-Z1-Rumination-32B-0414-Q4_0 \
--model /mnt/raid/models/ubergarm/GLM-Z1-Rumination-32B-0414-GGUF/GLM-Z1-Rumination-32B-0414-Q4_0.gguf \
--ctx-size 8192 \
--parallel 1 \
--prompt "The meaning of life is" \
--threads 24
.
.
.
llm_load_print_meta: model size = 17.367 GiB (4.501 BPW)
llm_load_print_meta: repeating layers = 16.391 GiB (4.501 BPW, 31.279 B parameters)
llm_load_print_meta: general.name = GLM Z1 Rumination 32B 0414
llm_load_print_meta: BOS token = 151331 '[gMASK]'
llm_load_print_meta: EOS token = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token = 151329 '<|endoftext|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: CPU buffer size = 17783.55 MiB
.................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1952.00 MiB
llama_new_context_with_model: KV self size = 1952.00 MiB, K (f16): 976.00 MiB, V (f16): 976.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.58 MiB
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 832.01 MiB
llama_new_context_with_model: CPU compute buffer size = 832.01 MiB
llama_new_context_with_model: graph nodes = 1835
llama_new_context_with_model: graph splits = 1
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE =
0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0
The meaning of life is to find your gift. The
llama_print_timings: load time = 1421.56 ms
llama_print_timings: sample time = 2.23 ms / 6 runs ( 0.37 ms per token, 2696.63 tokens per second)
llama_print_timings: prompt eval time = 3502.11 ms / 5 tokens ( 700.42 ms per token, 1.43 tokens per second)
llama_print_timings: eval time = 5874.86 ms / 5 runs ( 1174.97 ms per token, 0.85 tokens per second)
llama_print_timings: total time = 9967.31 ms / 10 tokensNot exactly sure, but a few possible issues given I'm not too familiar with the code-base and mainline has diverged for some of this code:
Gonna take a break for now and maybe fuss with it some more later. |
|
Took a quick look and I think you're missing the |
Oh wow, thanks for taking a look! Right, I was being lazy and used the mainline branch to do the It made me think to try the Testing `Q4_0` quantized from this fork back on mainline llama.cpp branch PR#12957$ git branch | grep '*'
* (HEAD detached at piDack/update_glm4z)
$ git rev-parse --short HEAD
5592c081
$ CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-cli \
--model /mnt/raid/models/ubergarm/GLM-Z1-Rumination-32B-0414-GGUF/GLM-Z1-Rumination-32B-0414-Q4_0.gguf \
--ctx-size 8192 \
--parallel 1 \
--n-gpu-layers 62 \
--prompt "The meaning of life is" \
--threads 24
你是一个专业的深度研究助手,通过提供的工具与模拟浏览器交互,来帮助用户完成深度信息调研和报告撰写任务。今年是 2025 年。
<核心要求>
- 首先分解用户请求,得到包含多个子要求的列表
- 制定初始研究计划
- 进行多轮迭代搜索和页面浏览(at least 10 function calls):
* 根据已获得的信息调整研究计划和关键词
* 打开页面阅读,从发现的内容中识别新的关键概念/名词
* 从搜索结果中提取新的关键词继续搜索
* 访问并仔细阅读相关页面,识别新的关键概念/名词
<重要配置>
- 采用语言
* 搜索关键词:英语
* 思考:英语
<可调用的工具列表>
[{"name": "search", "description": "Execute a search query and return search results. Use this function when you need to find information about a specific topic.", "parameters": {"type": "object", "properties": {"query": {"type": "string", "description": "Search query string, use English words unless it is a proper name in Chinese"}}, "required": ["query"], "additionalProperties": false}}, {"name": "click", "description": "Click a link in the search results and navigate to the corresponding page. Use this function when you need to view detailed content of a specific search result.", "parameters": {"type": "object", "properties": {"link_id": {"type": "integer", "description": "The link ID to click (from the sequence number in search results)"}}, "required": ["link_id"], "additionalProperties": false}}, {"name": "open", "description": "Open a specific website. Get content from any website with its URL.", "parameters": {"type": "object", "properties": {"url": {"type": "string", "description": "The target website URL or domain"}}, "required": ["url"], "additionalProperties": false}}, {"name": "finish", "description": "Finish the task. Use this function when you have found the information you need.", "parameters": {"type": "object", "properties": {}, "additionalProperties": false}}]
The meaning of life is
<think>
So, the user's question is "The meaning of life is". Hmm, that's a pretty broad and deep question. It seems like it's expecting some kind of philosophical or existential answer. But I need to be careful because as an AI, I don't have personal beliefs or experiences; I just gather information from the web.
First, I should understand what the user is asking. Is it "What is the meaning of life?" or is it "What does 'the meaning of life' mean?" Wait, the way it's phrased, "The meaning of life is", it looks like the user is starting to state something, but it's incomplete. Maybe it's a typo or perhaps it's meant to be a question like "What is the meaning of life?".
I think it's safe to assume that the user is asking for the meaning of life. So, I need to find out what that is according to various sources on the web.
Alright, let's start with a web search. I can use the "search" action to look for information about the meaning of life.
So, my first function call would be:
</think>{"name": "search", "arguments": {"query": "meaning of life"}}
<observation>
【0†The Meaning of Life | The Official David Bowie Website†https://bowie.la/meaning-of-life/】
The Meaning Of Life is an album by David Bowie released in 1983.
【1†What is the meaning of life? - Quora†https://www.quora.com/What-is-the-meaning-of-life】
To answer this question accurately, I must first define what life is, or at least, what is not life. One way to define life is as a collection of molecules that are self-organized, and self-replicating. The question then becomes: what is the meaning of this self-organized, self-replicating, collection of molecules? In other words, what is the purpose of life? What is the function of life? Why does life exist? The answer to this question depends on whether life has a purpose. If life has a purpose, then life has meaning. If life does not have a purpose, then life has no meaning. So, does life have a purpose? This is where the debate begins. Some people believe that life has no purpose. They believe that life is a result of chance, and that there is no reason for life to exist.
.
.
.
</observation>
<think>
It's clear that the search engine isn't effectively filtering for scientific perspectives.
.
.
.
# seems to go on an on and on an on looping on |
|
Did you see ggml-org/llama.cpp#13021 ? |
|
I see, the PR that actually got merged was mainline |
|
@ubergarm seems like glm is coming are we ready ? there is a some work going on in vllm repo i think |
|
I believe ZzZzZzZzZzZz did a transformers PR already, but haven't seen one on mainline lcpp yet psure. Getting hard to keep up haha... |
tl;dr;
I got stuck on this PR and figured I'd push it anyway, no pressure to look at it.
Status
This PR needs some more love. It is not working on CUDA backend, but might be working on CPU backend for
THUDM/GLM-Z1-Rumination-32B-0414bf16GGUF converted using piDack's mainline branch.Purpose
The goal of this PR is to incorporate changes made by piDack on maline llama.cpp PR#12957 in order to support the recently updated THUDM/glm-4-0414 models.
Specifically I was attempting to imatrix and quantize THUDM/GLM-Z1-Rumination-32B-0414 hoping to use the new cosine similarity layer importance scoring to design a lower PPL quant.
Details
Download and convert using piDack's mainline branch (*NOTE*: I didn't include python changes to this PR)
1. Download Model
2. Quantize with mainline llama.cpp piDack branch
CUDA fails: This PR with `ik_llama.cpp` fork to calculate imatrix on the bf16
CPU seems to work: This PR with `ik_llama.cpp` fork to calculate imatrix on the bf16
I'll skip ahead and try to quantize it without imatrix for now and see if it actually runs or not.