-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Labels
bugSomething isn't workingSomething isn't working
Description
Name and Version
version: 7524 (5ee4e43)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
llama-fit-params -hf unsloth/GLM-4.7-GGUF:IQ2_XXS -c 65536Problem description & steps to reproduce
On a host with 4x 3090 24GB, 1x 5060Ti 16GB, 64GB host RAM
Output without --verbose, here's where it gets stuck and stops logging:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 4: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
common_download_file_single_online: using cached file: /home/jasonl/.cache/llama.cpp/unsloth_GLM-4.7-GGUF_UD-IQ2_XXS_GLM-4.7-UD-IQ2_XXS-00001-of-00003.gguf
common_download_file_single_online: using cached file: /home/jasonl/.cache/llama.cpp/unsloth_GLM-4.7-GGUF_UD-IQ2_XXS_GLM-4.7-UD-IQ2_XXS-00002-of-00003.gguf
common_download_file_single_online: using cached file: /home/jasonl/.cache/llama.cpp/unsloth_GLM-4.7-GGUF_UD-IQ2_XXS_GLM-4.7-UD-IQ2_XXS-00003-of-00003.gguf
build: 7524 (5ee4e43f2) with GNU 13.3.0 for Linux x86_64
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090) : 24124 total, 27958 used, 4099 deficit
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090) : 24124 total, 30479 used, 6620 deficit
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 3090) : 24124 total, 29214 used, 5355 deficit
llama_params_fit_impl: - CUDA3 (NVIDIA GeForce RTX 3090) : 24115 total, 28974 used, 5131 deficit
llama_params_fit_impl: - CUDA4 (NVIDIA GeForce RTX 5060 Ti): 15848 total, 18177 used, 2468 deficit
llama_params_fit_impl: projected to use 134804 MiB of device memory vs. 112335 MiB of free device memory
llama_params_fit_impl: cannot fulfill margin of 1024 MiB on all devices, need to use 28796 MiB less in total
llama_params_fit_impl: context size set by user to 65536 -> no change
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 71665 MiB
llama_params_fit_impl: filling dense-only layers back-to-front:
llama_params_fit_impl: - CUDA4 (NVIDIA GeForce RTX 5060 Ti): 41 layers, 14446 MiB used, 1262 MiB free
llama_params_fit_impl: - CUDA3 (NVIDIA GeForce RTX 3090) : 53 layers, 18698 MiB used, 5143 MiB free
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 3090) : 0 layers, 0 MiB used, 23859 MiB free
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090) : 0 layers, 0 MiB used, 23859 MiB free
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090) : 0 layers, 1032 MiB used, 22826 MiB free
llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory:
llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090) : 18 layers ( 1 overflowing), 22833 MiB used, 1025 MiB free
llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090) : 15 layers ( 1 overflowing), 22430 MiB used, 1428 MiB free
llama_params_fit_impl: - CUDA2 (NVIDIA GeForce RTX 3090) : 15 layers ( 1 overflowing), 22657 MiB used, 1201 MiB free
First Bad Commit
No response
Relevant log output
When running with --verbose, the output log grows seemingly forever, or at least as long as I was willing to let it run. Here's ~ 200MB worth zipped:
[fit.txt.gz](https://github.com/user-attachments/files/24321672/fit.txt.gz)Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working