UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #315
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Mirrored from ggml-org/llama.cpp#17485
Dynamic
n_gpu_layersHeuristic for Low-VRAM GPUsSummary
This PR implements a dynamic
n_gpu_layerscalculation based on available VRAM to enable optimal GPU offloading on low-VRAM devices like the AMD RX 6500 XT.Motivation
The primary motivation for this PR is to enable practical, efficient use of llama.cpp on low-VRAM GPUs such as the AMD RX 6500 XT, which is particularly compelling due to its low power consumption and affordability. Many users—including the author—cannot justify purchasing a higher-end GPU, yet still want meaningful acceleration from Vulkan offloading.
Instead of requiring users to manually tune
n_gpu_layers, this PR automates the process to prevent OOM crashes while maximizing acceleration.The design also comports with the expectations outlined in the llama.cpp CONTRIBUTING.md guidelines:
Changes
Core Implementation
Dynamic Heuristic (
common/common.cpp):n_gpu_layersbased on available VRAMn_gpu_layers = -1(default)VRAM Query API (
ggml-vulkan.cpp):ggml_backend_vk_get_device_memory()to query available VRAMDocumentation & Testing
docs/windows_vulkan_low_vram.mdPerformance (llama-bench)
Hardware: AMD RX 6500 XT (4GB VRAM)
Model: Gemma 2B Q4_K_M (1.59 GiB)
Performance Summary
Multi-Model Results
Key Insight: The heuristic maximizes offloading for small models while preventing OOM on larger models.
Testing
Compliance
clang-formatrunMaintainer
Requesting review from @0cc4m (Vulkan backend maintainer per CODEOWNERS).
Willing to maintain long-term if accepted as collaborator.
and hope to extend this method to whisper and ggml for the same motivations!