-
Notifications
You must be signed in to change notification settings - Fork 248
Bug: Uneven graph mode vram usage/split #1569
Description
What happened?
When using graph mode with 4x3090 I see the following:
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 21420MiB |
| 1 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 21420MiB |
| 2 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 21424MiB |
| 3 N/A N/A 22671 C ...ma.cpp/build/bin/llama-server 23752MiB |
+-----------------------------------------------------------------------------------------+
I would like to increase my context and/or increase ubatch. I am going OOM because the vram split is not even.
Is there anything I can do about this? The VRAM usage is entirely from ik_llama.cpp. I do not run a GUI.
Using P2P drivers.
Name and Version
version: 4374 (b9a2ce4)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
INFO [ main] build info | tid="126184120086528" timestamp=1775091865 build=4374 commit="b9a2ce46"
INFO [ main] system info | tid="126184120086528" timestamp=1775091865 n_threads=16 n_threads_batch=-1 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24126 MiB
CUDA0: using device CUDA0 - 23721 MiB free
CUDA1: using device CUDA1 - 23721 MiB free
CUDA2: using device CUDA2 - 23721 MiB free
CUDA3: using device CUDA3 - 23721 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 724 tensors from /nvme/llm/gguf/L3.1-70B-Animus-V14.0-Q6_K_attn8_hb16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 0.900000
llama_model_loader: - kv 3: general.sampling.temp f32 = 0.600000
llama_model_loader: - kv 4: general.name str = L3.1 70B Animus V14.0
llama_model_loader: - kv 5: general.version str = V14.0
llama_model_loader: - kv 6: general.finetune str = Animus
llama_model_loader: - kv 7: general.basename str = L3.1
llama_model_loader: - kv 8: general.size_label str = 70B
llama_model_loader: - kv 9: general.license str = llama3.1
llama_model_loader: - kv 10: general.base_model.count u32 = 1
llama_model_loader: - kv 11: general.base_model.0.name str = Llama 3.1 70B Instruct
llama_model_loader: - kv 12: general.base_model.0.organization str = Meta Llama
llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_model_loader: - kv 14: general.tags arr[str,6] = ["finetune", "roleplay", "chat", "win...
llama_model_loader: - kv 15: llama.block_count u32 = 80
llama_model_loader: - kv 16: llama.context_length u32 = 131072
llama_model_loader: - kv 17: llama.embedding_length u32 = 8192
llama_model_loader: - kv 18: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 19: llama.attention.head_count u32 = 64
llama_model_loader: - kv 20: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 21: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 22: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: llama.attention.key_length u32 = 128
llama_model_loader: - kv 24: llama.attention.value_length u32 = 128
llama_model_loader: - kv 25: llama.vocab_size u32 = 128256
llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 35: tokenizer.ggml.add_sep_token bool = false
llama_model_loader: - kv 36: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 37: general.quantization_version u32 = 2
llama_model_loader: - kv 38: general.file_type u32 = 18
llama_model_loader: - type f32: 162 tensors
llama_model_loader: - type q8_0: 160 tensors
llama_model_loader: - type q6_K: 400 tensors
llama_model_loader: - type bf16: 2 tensors
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_n_group = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q6_K
llm_load_print_meta: model params = 70.554 B
llm_load_print_meta: model size = 57.576 GiB (7.010 BPW)
llm_load_print_meta: repeating layers = 53.662 GiB (6.734 BPW, 68.452 B parameters)
llm_load_print_meta: general.name = L3.1 70B Animus V14.0
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
======================================= HAVE_FANCY_SIMD is defined
Oops: tensor with strange name rope_freqs.weight
------------------- Layer sizes:
Layer 0: 686.88, 320.00, 1006.88 320.00 MiB
Layer 1: 686.88, 320.00, 1006.88 320.00 MiB
Layer 2: 686.88, 320.00, 1006.88 320.00 MiB
Layer 3: 686.88, 320.00, 1006.88 320.00 MiB
Layer 4: 686.88, 320.00, 1006.88 320.00 MiB
Layer 5: 686.88, 320.00, 1006.88 320.00 MiB
Layer 6: 686.88, 320.00, 1006.88 320.00 MiB
Layer 7: 686.88, 320.00, 1006.88 320.00 MiB
Layer 8: 686.88, 320.00, 1006.88 320.00 MiB
Layer 9: 686.88, 320.00, 1006.88 320.00 MiB
Layer 10: 686.88, 320.00, 1006.88 320.00 MiB
Layer 11: 686.88, 320.00, 1006.88 320.00 MiB
Layer 12: 686.88, 320.00, 1006.88 320.00 MiB
Layer 13: 686.88, 320.00, 1006.88 320.00 MiB
Layer 14: 686.88, 320.00, 1006.88 320.00 MiB
Layer 15: 686.88, 320.00, 1006.88 320.00 MiB
Layer 16: 686.88, 320.00, 1006.88 320.00 MiB
Layer 17: 686.88, 320.00, 1006.88 320.00 MiB
Layer 18: 686.88, 320.00, 1006.88 320.00 MiB
Layer 19: 686.88, 320.00, 1006.88 320.00 MiB
Layer 20: 686.88, 320.00, 1006.88 320.00 MiB
Layer 21: 686.88, 320.00, 1006.88 320.00 MiB
Layer 22: 686.88, 320.00, 1006.88 320.00 MiB
Layer 23: 686.88, 320.00, 1006.88 320.00 MiB
Layer 24: 686.88, 320.00, 1006.88 320.00 MiB
Layer 25: 686.88, 320.00, 1006.88 320.00 MiB
Layer 26: 686.88, 320.00, 1006.88 320.00 MiB
Layer 27: 686.88, 320.00, 1006.88 320.00 MiB
Layer 28: 686.88, 320.00, 1006.88 320.00 MiB
Layer 29: 686.88, 320.00, 1006.88 320.00 MiB
Layer 30: 686.88, 320.00, 1006.88 320.00 MiB
Layer 31: 686.88, 320.00, 1006.88 320.00 MiB
Layer 32: 686.88, 320.00, 1006.88 320.00 MiB
Layer 33: 686.88, 320.00, 1006.88 320.00 MiB
Layer 34: 686.88, 320.00, 1006.88 320.00 MiB
Layer 35: 686.88, 320.00, 1006.88 320.00 MiB
Layer 36: 686.88, 320.00, 1006.88 320.00 MiB
Layer 37: 686.88, 320.00, 1006.88 320.00 MiB
Layer 38: 686.88, 320.00, 1006.88 320.00 MiB
Layer 39: 686.88, 320.00, 1006.88 320.00 MiB
Layer 40: 686.88, 320.00, 1006.88 320.00 MiB
Layer 41: 686.88, 320.00, 1006.88 320.00 MiB
Layer 42: 686.88, 320.00, 1006.88 320.00 MiB
Layer 43: 686.88, 320.00, 1006.88 320.00 MiB
Layer 44: 686.88, 320.00, 1006.88 320.00 MiB
Layer 45: 686.88, 320.00, 1006.88 320.00 MiB
Layer 46: 686.88, 320.00, 1006.88 320.00 MiB
Layer 47: 686.88, 320.00, 1006.88 320.00 MiB
Layer 48: 686.88, 320.00, 1006.88 320.00 MiB
Layer 49: 686.88, 320.00, 1006.88 320.00 MiB
Layer 50: 686.88, 320.00, 1006.88 320.00 MiB
Layer 51: 686.88, 320.00, 1006.88 320.00 MiB
Layer 52: 686.88, 320.00, 1006.88 320.00 MiB
Layer 53: 686.88, 320.00, 1006.88 320.00 MiB
Layer 54: 686.88, 320.00, 1006.88 320.00 MiB
Layer 55: 686.88, 320.00, 1006.88 320.00 MiB
Layer 56: 686.88, 320.00, 1006.88 320.00 MiB
Layer 57: 686.88, 320.00, 1006.88 320.00 MiB
Layer 58: 686.88, 320.00, 1006.88 320.00 MiB
Layer 59: 686.88, 320.00, 1006.88 320.00 MiB
Layer 60: 686.88, 320.00, 1006.88 320.00 MiB
Layer 61: 686.88, 320.00, 1006.88 320.00 MiB
Layer 62: 686.88, 320.00, 1006.88 320.00 MiB
Layer 63: 686.88, 320.00, 1006.88 320.00 MiB
Layer 64: 686.88, 320.00, 1006.88 320.00 MiB
Layer 65: 686.88, 320.00, 1006.88 320.00 MiB
Layer 66: 686.88, 320.00, 1006.88 320.00 MiB
Layer 67: 686.88, 320.00, 1006.88 320.00 MiB
Layer 68: 686.88, 320.00, 1006.88 320.00 MiB
Layer 69: 686.88, 320.00, 1006.88 320.00 MiB
Layer 70: 686.88, 320.00, 1006.88 320.00 MiB
Layer 71: 686.88, 320.00, 1006.88 320.00 MiB
Layer 72: 686.88, 320.00, 1006.88 320.00 MiB
Layer 73: 686.88, 320.00, 1006.88 320.00 MiB
Layer 74: 686.88, 320.00, 1006.88 320.00 MiB
Layer 75: 686.88, 320.00, 1006.88 320.00 MiB
Layer 76: 686.88, 320.00, 1006.88 320.00 MiB
Layer 77: 686.88, 320.00, 1006.88 320.00 MiB
Layer 78: 686.88, 320.00, 1006.88 320.00 MiB
Layer 79: 686.88, 320.00, 1006.88 320.00 MiB
Layer 80: 2004.00, 181.00, 2185.00 MiB (output layer)
--------------------------------------------------------------------------
Total : 54950.00, 25781.00, 80731.00 MiB
Memory required for model tensors + cache: 82735 MiB
Memory available on all devices - compute: 89509 MiB
Setting default device in layer 0 to 0
Setting default device in layer 1 to 0
Setting default device in layer 2 to 0
Setting default device in layer 3 to 0
Setting default device in layer 4 to 0
Setting default device in layer 5 to 0
Setting default device in layer 6 to 0
Setting default device in layer 7 to 0
Setting default device in layer 8 to 0
Setting default device in layer 9 to 0
Setting default device in layer 10 to 0
Setting default device in layer 11 to 0
Setting default device in layer 12 to 0
Setting default device in layer 13 to 0
Setting default device in layer 14 to 0
Setting default device in layer 15 to 0
Setting default device in layer 16 to 0
Setting default device in layer 17 to 0
Setting default device in layer 18 to 0
Setting default device in layer 19 to 0
Setting default device in layer 20 to 0
Setting default device in layer 21 to 1
Setting default device in layer 22 to 1
Setting default device in layer 23 to 1
Setting default device in layer 24 to 1
Setting default device in layer 25 to 1
Setting default device in layer 26 to 1
Setting default device in layer 27 to 1
Setting default device in layer 28 to 1
Setting default device in layer 29 to 1
Setting default device in layer 30 to 1
Setting default device in layer 31 to 1
Setting default device in layer 32 to 1
Setting default device in layer 33 to 1
Setting default device in layer 34 to 1
Setting default device in layer 35 to 1
Setting default device in layer 36 to 1
Setting default device in layer 37 to 1
Setting default device in layer 38 to 1
Setting default device in layer 39 to 1
Setting default device in layer 40 to 1
Setting default device in layer 41 to 2
Setting default device in layer 42 to 2
Setting default device in layer 43 to 2
Setting default device in layer 44 to 2
Setting default device in layer 45 to 2
Setting default device in layer 46 to 2
Setting default device in layer 47 to 2
Setting default device in layer 48 to 2
Setting default device in layer 49 to 2
Setting default device in layer 50 to 2
Setting default device in layer 51 to 2
Setting default device in layer 52 to 2
Setting default device in layer 53 to 2
Setting default device in layer 54 to 2
Setting default device in layer 55 to 2
Setting default device in layer 56 to 2
Setting default device in layer 57 to 2
Setting default device in layer 58 to 2
Setting default device in layer 59 to 2
Setting default device in layer 60 to 2
Setting default device in layer 61 to 2
Setting default device in layer 62 to 3
Setting default device in layer 63 to 3
Setting default device in layer 64 to 3
Setting default device in layer 65 to 3
Setting default device in layer 66 to 3
Setting default device in layer 67 to 3
Setting default device in layer 68 to 3
Setting default device in layer 69 to 3
Setting default device in layer 70 to 3
Setting default device in layer 71 to 3
Setting default device in layer 72 to 3
Setting default device in layer 73 to 3
Setting default device in layer 74 to 3
Setting default device in layer 75 to 3
Setting default device in layer 76 to 3
Setting default device in layer 77 to 3
Setting default device in layer 78 to 3
Setting default device in layer 79 to 3
Setting default device in layer 80 to 3
llm_load_tensors: ggml ctx size = 13.53 MiB
================================ max_gpu = 4
Estimated model buffer size per device:
Device 0: 13741.27 MiB
Device 1: 13741.27 MiB
Device 2: 13741.27 MiB
Device 3: 13741.27 MiB
No tensors in buffer type CUDA0
No tensors in buffer type CUDA1
No tensors in buffer type CUDA2
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: CPU buffer size = 2004.00 MiB
llm_load_tensors: CUDA_Split buffer size = 54965.62 MiB
llm_load_tensors: CUDA3 buffer size = 2004.03 MiB
...............................................................................................
llama_init_from_model: n_ctx = 81920
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 1024
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: v_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 500000.0
llama_init_from_model: freq_scale = 1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->2
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 0->3
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->0
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->2
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 1->3
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->0
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 2->3
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->0
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->1
=========================== ggml_cuda_set_peer_access: Enabling Peer Access between Devices 3->2
llama_kv_cache_init: CUDA_Split KV buffer size = 25600.23 MiB
llama_kv_cache_init: KV cache size per device:
llama_init_from_model: KV self size = 25600.00 MiB, K (f16): 12800.00 MiB, V (f16): 12800.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.49 MiB
llama_init_from_model: CUDA0 compute buffer size = 224.00 MiB
llama_init_from_model: CUDA1 compute buffer size = 224.00 MiB
llama_init_from_model: CUDA2 compute buffer size = 224.00 MiB
llama_init_from_model: CUDA3 compute buffer size = 533.00 MiB
llama_init_from_model: CUDA_Host compute buffer size = 192.01 MiB
llama_init_from_model: graph nodes = 7688
llama_init_from_model: graph splits = 801
llama_init_from_model: enabling only_active_experts scheduling
=============================== NCCL main communicator initialized
=============================== NCCL pair communicators for 4 GPUs initialized
Device 0: 6400 MiB
Device 1: 6400 MiB
Device 2: 6400 MiB
Device 3: 6400 MiB