Skip to content

Bug: Uneven graph mode vram usage/split #1569

@frenzybiscuit

Description

@frenzybiscuit

What happened?

When using graph mode with 4x3090 I see the following:

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 21420MiB |
| 1 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 21420MiB |
| 2 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 21424MiB |
| 3 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 23752MiB |
+-----------------------------------------------------------------------------------------+

I would like to increase my context and/or increase ubatch. I am going OOM because the vram split is not even.

Is there anything I can do about this? The VRAM usage is entirely from ik_llama.cpp. I do not run a GUI.

Using P2P drivers.

Name and Version

version: 4374 (b9a2ce4)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

INFO [                    main] build info | tid="126184120086528" timestamp=1775091865 build=4374 commit="b9a2ce46"
INFO [                    main] system info | tid="126184120086528" timestamp=1775091865 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
CUDA0: using device CUDA0 - 23721 MiB free
CUDA1: using device CUDA1 - 23721 MiB free
CUDA2: using device CUDA2 - 23721 MiB free
CUDA3: using device CUDA3 - 23721 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /nvme/llm/gguf/L3.1-70B-Animus-V14.0-Q6_K_attn8_hb16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_p f32              = 0.900000
llama_model_loader: - kv   3:                      general.sampling.temp f32              = 0.600000
llama_model_loader: - kv   4:                               general.name str              = L3.1 70B Animus V14.0
llama_model_loader: - kv   5:                            general.version str              = V14.0
llama_model_loader: - kv   6:                           general.finetune str              = Animus
llama_model_loader: - kv   7:                           general.basename str              = L3.1
llama_model_loader: - kv   8:                         general.size_label str              = 70B
llama_model_loader: - kv   9:                            general.license str              = llama3.1
llama_model_loader: - kv  10:                   general.base_model.count u32              = 1
llama_model_loader: - kv  11:                  general.base_model.0.name str              = Llama 3.1 70B Instruct
llama_model_loader: - kv  12:          general.base_model.0.organization str              = Meta Llama
llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv  14:                               general.tags arr[str,6]       = ["finetune", "roleplay", "chat", "win...
llama_model_loader: - kv  15:                          llama.block_count u32              = 80
llama_model_loader: - kv  16:                       llama.context_length u32              = 131072
llama_model_loader: - kv  17:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv  18:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv  19:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv  20:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  22:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  23:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  24:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  25:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - kv  38:                          general.file_type u32              = 18
llama_model_loader: - type  f32:  162 tensors
llama_model_loader: - type q8_0:  160 tensors
llama_model_loader: - type q6_K:  400 tensors
llama_model_loader: - type bf16:    2 tensors
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_swa_pattern    = 1
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_n_group      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 70.554 B
llm_load_print_meta: model size       = 57.576 GiB (7.010 BPW) 
llm_load_print_meta: repeating layers = 53.662 GiB (6.734 BPW, 68.452 B parameters)
llm_load_print_meta: general.name     = L3.1 70B Animus V14.0
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
======================================= HAVE_FANCY_SIMD is defined
Oops: tensor with strange name rope_freqs.weight
------------------- Layer sizes:
Layer  0:    686.88,    320.00,   1006.88      320.00  MiB
Layer  1:    686.88,    320.00,   1006.88      320.00  MiB
Layer  2:    686.88,    320.00,   1006.88      320.00  MiB
Layer  3:    686.88,    320.00,   1006.88      320.00  MiB
Layer  4:    686.88,    320.00,   1006.88      320.00  MiB
Layer  5:    686.88,    320.00,   1006.88      320.00  MiB
Layer  6:    686.88,    320.00,   1006.88      320.00  MiB
Layer  7:    686.88,    320.00,   1006.88      320.00  MiB
Layer  8:    686.88,    320.00,   1006.88      320.00  MiB
Layer  9:    686.88,    320.00,   1006.88      320.00  MiB
Layer 10:    686.88,    320.00,   1006.88      320.00  MiB
Layer 11:    686.88,    320.00,   1006.88      320.00  MiB
Layer 12:    686.88,    320.00,   1006.88      320.00  MiB
Layer 13:    686.88,    320.00,   1006.88      320.00  MiB
Layer 14:    686.88,    320.00,   1006.88      320.00  MiB
Layer 15:    686.88,    320.00,   1006.88      320.00  MiB
Layer 16:    686.88,    320.00,   1006.88      320.00  MiB
Layer 17:    686.88,    320.00,   1006.88      320.00  MiB
Layer 18:    686.88,    320.00,   1006.88      320.00  MiB
Layer 19:    686.88,    320.00,   1006.88      320.00  MiB
Layer 20:    686.88,    320.00,   1006.88      320.00  MiB
Layer 21:    686.88,    320.00,   1006.88      320.00  MiB
Layer 22:    686.88,    320.00,   1006.88      320.00  MiB
Layer 23:    686.88,    320.00,   1006.88      320.00  MiB
Layer 24:    686.88,    320.00,   1006.88      320.00  MiB
Layer 25:    686.88,    320.00,   1006.88      320.00  MiB
Layer 26:    686.88,    320.00,   1006.88      320.00  MiB
Layer 27:    686.88,    320.00,   1006.88      320.00  MiB
Layer 28:    686.88,    320.00,   1006.88      320.00  MiB
Layer 29:    686.88,    320.00,   1006.88      320.00  MiB
Layer 30:    686.88,    320.00,   1006.88      320.00  MiB
Layer 31:    686.88,    320.00,   1006.88      320.00  MiB
Layer 32:    686.88,    320.00,   1006.88      320.00  MiB
Layer 33:    686.88,    320.00,   1006.88      320.00  MiB
Layer 34:    686.88,    320.00,   1006.88      320.00  MiB
Layer 35:    686.88,    320.00,   1006.88      320.00  MiB
Layer 36:    686.88,    320.00,   1006.88      320.00  MiB
Layer 37:    686.88,    320.00,   1006.88      320.00  MiB
Layer 38:    686.88,    320.00,   1006.88      320.00  MiB
Layer 39:    686.88,    320.00,   1006.88      320.00  MiB
Layer 40:    686.88,    320.00,   1006.88      320.00  MiB
Layer 41:    686.88,    320.00,   1006.88      320.00  MiB
Layer 42:    686.88,    320.00,   1006.88      320.00  MiB
Layer 43:    686.88,    320.00,   1006.88      320.00  MiB
Layer 44:    686.88,    320.00,   1006.88      320.00  MiB
Layer 45:    686.88,    320.00,   1006.88      320.00  MiB
Layer 46:    686.88,    320.00,   1006.88      320.00  MiB
Layer 47:    686.88,    320.00,   1006.88      320.00  MiB
Layer 48:    686.88,    320.00,   1006.88      320.00  MiB
Layer 49:    686.88,    320.00,   1006.88      320.00  MiB
Layer 50:    686.88,    320.00,   1006.88      320.00  MiB
Layer 51:    686.88,    320.00,   1006.88      320.00  MiB
Layer 52:    686.88,    320.00,   1006.88      320.00  MiB
Layer 53:    686.88,    320.00,   1006.88      320.00  MiB
Layer 54:    686.88,    320.00,   1006.88      320.00  MiB
Layer 55:    686.88,    320.00,   1006.88      320.00  MiB
Layer 56:    686.88,    320.00,   1006.88      320.00  MiB
Layer 57:    686.88,    320.00,   1006.88      320.00  MiB
Layer 58:    686.88,    320.00,   1006.88      320.00  MiB
Layer 59:    686.88,    320.00,   1006.88      320.00  MiB
Layer 60:    686.88,    320.00,   1006.88      320.00  MiB
Layer 61:    686.88,    320.00,   1006.88      320.00  MiB
Layer 62:    686.88,    320.00,   1006.88      320.00  MiB
Layer 63:    686.88,    320.00,   1006.88      320.00  MiB
Layer 64:    686.88,    320.00,   1006.88      320.00  MiB
Layer 65:    686.88,    320.00,   1006.88      320.00  MiB
Layer 66:    686.88,    320.00,   1006.88      320.00  MiB
Layer 67:    686.88,    320.00,   1006.88      320.00  MiB
Layer 68:    686.88,    320.00,   1006.88      320.00  MiB
Layer 69:    686.88,    320.00,   1006.88      320.00  MiB
Layer 70:    686.88,    320.00,   1006.88      320.00  MiB
Layer 71:    686.88,    320.00,   1006.88      320.00  MiB
Layer 72:    686.88,    320.00,   1006.88      320.00  MiB
Layer 73:    686.88,    320.00,   1006.88      320.00  MiB
Layer 74:    686.88,    320.00,   1006.88      320.00  MiB
Layer 75:    686.88,    320.00,   1006.88      320.00  MiB
Layer 76:    686.88,    320.00,   1006.88      320.00  MiB
Layer 77:    686.88,    320.00,   1006.88      320.00  MiB
Layer 78:    686.88,    320.00,   1006.88      320.00  MiB
Layer 79:    686.88,    320.00,   1006.88      320.00  MiB
Layer 80:   2004.00,    181.00,   2185.00 MiB (output layer)
--------------------------------------------------------------------------
Total   :  54950.00,  25781.00,  80731.00 MiB
Memory required for model tensors + cache: 82735 MiB
Memory available on all devices - compute: 89509 MiB
Setting default device in layer  0 to 0
Setting default device in layer  1 to 0
Setting default device in layer  2 to 0
Setting default device in layer  3 to 0
Setting default device in layer  4 to 0
Setting default device in layer  5 to 0
Setting default device in layer  6 to 0
Setting default device in layer  7 to 0
Setting default device in layer  8 to 0
Setting default device in layer  9 to 0
Setting default device in layer 10 to 0
Setting default device in layer 11 to 0
Setting default device in layer 12 to 0
Setting default device in layer 13 to 0
Setting default device in layer 14 to 0
Setting default device in layer 15 to 0
Setting default device in layer 16 to 0
Setting default device in layer 17 to 0
Setting default device in layer 18 to 0
Setting default device in layer 19 to 0
Setting default device in layer 20 to 0
Setting default device in layer 21 to 1
Setting default device in layer 22 to 1
Setting default device in layer 23 to 1
Setting default device in layer 24 to 1
Setting default device in layer 25 to 1
Setting default device in layer 26 to 1
Setting default device in layer 27 to 1
Setting default device in layer 28 to 1
Setting default device in layer 29 to 1
Setting default device in layer 30 to 1
Setting default device in layer 31 to 1
Setting default device in layer 32 to 1
Setting default device in layer 33 to 1
Setting default device in layer 34 to 1
Setting default device in layer 35 to 1
Setting default device in layer 36 to 1
Setting default device in layer 37 to 1
Setting default device in layer 38 to 1
Setting default device in layer 39 to 1
Setting default device in layer 40 to 1
Setting default device in layer 41 to 2
Setting default device in layer 42 to 2
Setting default device in layer 43 to 2
Setting default device in layer 44 to 2
Setting default device in layer 45 to 2
Setting default device in layer 46 to 2
Setting default device in layer 47 to 2
Setting default device in layer 48 to 2
Setting default device in layer 49 to 2
Setting default device in layer 50 to 2
Setting default device in layer 51 to 2
Setting default device in layer 52 to 2
Setting default device in layer 53 to 2
Setting default device in layer 54 to 2
Setting default device in layer 55 to 2
Setting default device in layer 56 to 2
Setting default device in layer 57 to 2
Setting default device in layer 58 to 2
Setting default device in layer 59 to 2
Setting default device in layer 60 to 2
Setting default device in layer 61 to 2
Setting default device in layer 62 to 3
Setting default device in layer 63 to 3
Setting default device in layer 64 to 3
Setting default device in layer 65 to 3
Setting default device in layer 66 to 3
Setting default device in layer 67 to 3
Setting default device in layer 68 to 3
Setting default device in layer 69 to 3
Setting default device in layer 70 to 3
Setting default device in layer 71 to 3
Setting default device in layer 72 to 3
Setting default device in layer 73 to 3
Setting default device in layer 74 to 3
Setting default device in layer 75 to 3
Setting default device in layer 76 to 3
Setting default device in layer 77 to 3
Setting default device in layer 78 to 3
Setting default device in layer 79 to 3
Setting default device in layer 80 to 3
llm_load_tensors: ggml ctx size =   13.53 MiB
================================ max_gpu = 4
Estimated model buffer size per device:
    Device 0:  13741.27 MiB
    Device 1:  13741.27 MiB
    Device 2:  13741.27 MiB
    Device 3:  13741.27 MiB
No tensors in buffer type CUDA0
No tensors in buffer type CUDA1
No tensors in buffer type CUDA2
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =  2004.00 MiB
llm_load_tensors: CUDA_Split buffer size = 54965.62 MiB
llm_load_tensors:      CUDA3 buffer size =  2004.03 MiB
...............................................................................................
llama_init_from_model: n_ctx         = 81920
llama_init_from_model: n_batch       = 1024
llama_init_from_model: n_ubatch      = 1024
llama_init_from_model: flash_attn    = 1
llama_init_from_model: attn_max_b    = 0
llama_init_from_model: fused_moe     = 1
llama_init_from_model: grouped er    = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad    = 1
llama_init_from_model: rope_cache    = 0
llama_init_from_model: graph_reuse   = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type   = f16
llama_init_from_model: sched_async   = 0
llama_init_from_model: ser           = -1, 0
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->2
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->3
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->2
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->3
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->0
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->3
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->0
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->1
 =========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->2
llama_kv_cache_init: CUDA_Split KV buffer size = 25600.23 MiB
llama_kv_cache_init: KV cache size per device:
llama_init_from_model: KV self size  = 25600.00 MiB, K (f16): 12800.00 MiB, V (f16): 12800.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =   224.00 MiB
llama_init_from_model:      CUDA1 compute buffer size =   224.00 MiB
llama_init_from_model:      CUDA2 compute buffer size =   224.00 MiB
llama_init_from_model:      CUDA3 compute buffer size =   533.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   192.01 MiB
llama_init_from_model: graph nodes  = 7688
llama_init_from_model: graph splits = 801
llama_init_from_model: enabling only_active_experts scheduling
=============================== NCCL main communicator initialized
=============================== NCCL pair communicators for 4 GPUs initialized
    Device 0:  6400 MiB
    Device 1:  6400 MiB
    Device 2:  6400 MiB
    Device 3:  6400 MiB

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions