UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #315

loci-dev · 2025-11-25T05:37:36Z

Dynamic `n_gpu_layers` Heuristic for Low-VRAM GPUs

Summary

This PR implements a dynamic n_gpu_layers calculation based on available VRAM to enable optimal GPU offloading on low-VRAM devices like the AMD RX 6500 XT.

Motivation

The primary motivation for this PR is to enable practical, efficient use of llama.cpp on low-VRAM GPUs such as the AMD RX 6500 XT, which is particularly compelling due to its low power consumption and affordability. Many users—including the author—cannot justify purchasing a higher-end GPU, yet still want meaningful acceleration from Vulkan offloading.

Instead of requiring users to manually tune n_gpu_layers, this PR automates the process to prevent OOM crashes while maximizing acceleration.

The design also comports with the expectations outlined in the llama.cpp CONTRIBUTING.md guidelines:

The feature is self-contained and maintains codebase minimalism.
It adds functionality without modifying core operators.
It uses clear naming conventions and avoids architectural complexity.
It provides documentation, benchmarks, and reasoning consistent with contributor requirements.

Changes

Core Implementation

Dynamic Heuristic (common/common.cpp):

Queries GGUF metadata for model size and layer count
Calculates optimal n_gpu_layers based on available VRAM
Reserves 800MB overhead for KV cache and compute buffers
Triggered when n_gpu_layers = -1 (default)
Generalizes across architectures (Gemma, Llama, Qwen, etc.)

VRAM Query API (ggml-vulkan.cpp):

Added ggml_backend_vk_get_device_memory() to query available VRAM
Exposes device memory info to heuristic layer

Documentation & Testing

Added docs/windows_vulkan_low_vram.md
Benchmark scripts for validation
Inline comments explaining heuristic logic

Performance (llama-bench)

Hardware: AMD RX 6500 XT (4GB VRAM)
Model: Gemma 2B Q4_K_M (1.59 GiB)

Performance Summary

Metric	CPU-only	GPU Heuristic	Improvement
Prompt processing (pp512)	497 t/s	1231 t/s	+147%
Token generation (tg128)	19.4 t/s	60.4 t/s	+212%
Layers offloaded	0/27	26/27	Auto-optimized

Multi-Model Results

Model	Size	Layers Offloaded	Performance
Gemma 2B	1.6GB	26/27 (96%)	2.5–3.1× faster
Llama 3.2 3B	1.9GB	28/29 (97%)	~2× faster
Llama 2 7B	3.9GB	21/33 (64%)	1.6× faster

Key Insight: The heuristic maximizes offloading for small models while preventing OOM on larger models.

Testing

✅ llama-bench: Verified 2.5-3.1x speedup on Gemma 2B
✅ Multi-model: Tested on Gemma 2B, Llama 2 7B, Llama 2 13B
✅ OOM Prevention: Larger models gracefully degrade (no crashes)
✅ Platform: Windows 11, AMD RX 6500 XT
⏳ Cross-platform: Linux/macOS testing pending (code is platform-agnostic, so I don't anticipate there being any issues here)

Compliance

✅ No third-party dependencies
✅ Follows naming conventions (snake_case, longest prefix)
✅ No ggml operators modified
✅ Trailing whitespace cleaned
✅ clang-format run

Maintainer

Requesting review from @0cc4m (Vulkan backend maintainer per CODEOWNERS).
Willing to maintain long-term if accepted as collaborator.
and hope to extend this method to whisper and ggml for the same motivations!

vulkan : add dynamic VRAM heuristic for low-VRAM GPUs

5ecff8a

loci-dev force-pushed the main branch 12 times, most recently from 92ef8cd to 7dd50b8 Compare November 26, 2025 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #315

UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #315

Uh oh!

loci-dev commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #315

Are you sure you want to change the base?

UPSTREAM PR #17485: vulkan : add dynamic VRAM heuristic for low-VRAM GPUs #315

Uh oh!

Conversation

loci-dev commented Nov 25, 2025

Dynamic n_gpu_layers Heuristic for Low-VRAM GPUs

Summary

Motivation

Changes

Core Implementation

Documentation & Testing

Performance (llama-bench)

Performance Summary

Multi-Model Results

Testing

Compliance

Maintainer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dynamic `n_gpu_layers` Heuristic for Low-VRAM GPUs