-
Notifications
You must be signed in to change notification settings - Fork 155
Closed
Description
What happened?
failed with command when bs >2
numactl -m 0 -C 0-127 ./llama-batched-bench -m /models/unsloth/Qwen3-235B-A22B-GGUF/Q4_K_M/00001.gguf -c 8192 -b 2048 -ub 512 -ngl 0 -npp 128 -ntg 128 -npl 1,2,4 --cache-type-k q8_0 --numa numactl --threads 64 --threads-batch 128 -fa -fmoe -amb 1 -ser 7,1 -mla 1 --no-mmap
Name and Version
build: e3fec17 (3667)
What operating system are you seeing the problem on?
Linux
Relevant log output
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 46 key-value pairs and 1131 tensors from /models/unsloth/Qwen3-235B-A22B-GGUF/Q4_K_M/Qwen3-235B-A22B-Q4_K_M-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B
llama_model_loader: - kv 3: general.basename str = Qwen3-235B-A22B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 235B-A22B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 8: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 9: general.base_model.count u32 = 1
llama_model_loader: - kv 10: general.base_model.0.name str = Qwen3 235B A22B
llama_model_loader: - kv 11: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 13: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 14: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 15: qwen3moe.context_length u32 = 40960
llama_model_loader: - kv 16: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 17: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 18: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 19: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 20: qwen3moe.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 21: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 22: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 23: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 24: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 25: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 26: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 15
llama_model_loader: - kv 39: quantize.imatrix.file str = Qwen3-235B-A22B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 40: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B.txt
llama_model_loader: - kv 41: quantize.imatrix.entries_count i32 = 752
llama_model_loader: - kv 42: quantize.imatrix.chunks_count i32 = 32
llama_model_loader: - kv 43: split.no u16 = 0
llama_model_loader: - kv 44: split.tensors.count i32 = 1131
llama_model_loader: - kv 45: split.count u16 = 3
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q4_K: 567 tensors
llama_model_loader: - type q6_K: 93 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen3moe
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 40960
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 94
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 16
llm_load_print_meta: n_embd_k_gqa = 512
llm_load_print_meta: n_embd_v_gqa = 512
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 12288
llm_load_print_meta: n_expert = 128
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 40960
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 235.094 B
llm_load_print_meta: model size = 132.386 GiB (4.837 BPW)
llm_load_print_meta: repeating layers = 131.584 GiB (4.833 BPW, 233.849 B parameters)
llm_load_print_meta: general.name = Qwen3-235B-A22B
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp = 1536
llm_load_tensors: ggml ctx size = 0.50 MiB
llm_load_tensors: CPU buffer size = 135562.96 MiB
....................................................................................................
=====================================================================
MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
=====================================================================
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 1
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = 7, 1
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 1151.50 MiB
llama_new_context_with_model: KV self size = 1151.50 MiB, K (q8_0): 399.50 MiB, V (f16): 752.00 MiB
llama_new_context_with_model: CPU output buffer size = 2.32 MiB
llama_new_context_with_model: CPU compute buffer size = 304.75 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 942
Unable to find TSan function AnnotateHappensAfter.
Unable to find TSan function AnnotateHappensBefore.
Unable to find TSan function AnnotateIgnoreWritesBegin.
Unable to find TSan function AnnotateIgnoreWritesEnd.
Unable to find TSan function AnnotateNewMemory.
Unable to find TSan function __tsan_func_entry.
Unable to find TSan function __tsan_func_exit.
Warning: please export TSAN_OPTIONS='ignore_noninstrumented_modules=1' to avoid false positive reports from the OpenMP runtime!
main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 0, n_threads = 64, n_threads_batch = 128
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 128 | 128 | 1 | 256 | 1.778 | 71.99 | 5.578 | 22.95 | 7.357 | 34.80 |
| 128 | 128 | 2 | 512 | 2.265 | 113.01 | 7.968 | 32.13 | 10.233 | 50.03 |
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failedGGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
libggml.so(+0x134d7)[0x725d77a3e4d7]
libggml.so(ggml_abort+0xd8)[0x725d77a3e468]
libggml.so(+0xcbf7da)[0x725d786ea7da]
OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
libggml.so(+0x468f0a)[0x725d77e93f0a]
libggml.so(_Z19iqk_flash_attn_impliiiiiiiiiiiPKfPKvS2_S2_ffPfS3_S3_+0x405)[0x725d77d0a175]
libggml.so(iqk_flash_attn_noalibi+0x1419)[0x725d79cc7e29]
libggml.so(+0x3a347)[0x725d77a65347]
/usr/local/lib/libiomp5.so(__kmp_invoke_microtask+0x93)[0x725d7a145603]
/usr/local/lib/libiomp5.so(+0xca633)[0x725d7a0ca633]
/usr/local/lib/libiomp5.so(+0xc90ae)[0x725d7a0c90ae]
/usr/local/lib/libiomp5.so(+0x146c21)[0x725d7a146c21]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x725d7766aac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x725d776fc850]
Aborted (core dumped)
Metadata
Metadata
Assignees
Labels
No labels