Skip to content

Conversation

giladgd
Copy link
Contributor

@giladgd giladgd commented Aug 24, 2025

Use the memory budget extension when it's available to read the memory consumption of a Vulkan device.

@giladgd giladgd requested a review from 0cc4m as a code owner August 24, 2025 17:58
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 24, 2025
@jeffbolznv
Copy link
Collaborator

Is this solving any particular problem?

@giladgd
Copy link
Contributor Author

giladgd commented Aug 24, 2025

Yes, I forgot to mention.
In derived modules (like node-llama-cpp in my case) that use ggml_backend_dev_memory to check the memory usage of a backend device, the CUDA backend reports the actual memory usage just fine, but in Vulkan it always shows that the entire memory is free regardless of actual usage. This PR solves that.

@slaren
Copy link
Member

slaren commented Aug 24, 2025

Additionally in llama.cpp, when using multiple GPUs, the free memory is used to determine the default layer split.


for (const auto & ext : extensionprops) {
if (std::string(ext.extensionName.data()) == VK_EXT_MEMORY_BUDGET_EXTENSION_NAME) {
membudget_extension_supported = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list can include hundreds of extensions, I think you should precompute this when the instance is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've moved the support detection to ggml_vk_instance_init


bool membudget_supported = false;
for (const auto & ext : extensionprops) {
if (std::string(ext.extensionName.data()) == VK_EXT_MEMORY_BUDGET_EXTENSION_NAME) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer strcmp, but it's not a huge deal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 31, 2025

This implementation does not work yet. The problem is that heapUsage will only show the current process heap usage, which at the start of the process is basically 0. See VK_EXT_memory_budget documentation. The correct way is to return the memoryBudget instead. I have fixed this and also combined the two getMemoryProperties calls.

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index b7f8b5a38..96e244c72 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -11497,25 +11497,24 @@ void ggml_backend_vk_get_device_memory(int device, size_t * free, size_t * total
     GGML_ASSERT(device < (int) vk_instance.device_supports_membudget.size());
 
     vk::PhysicalDevice vkdev = vk_instance.instance.enumeratePhysicalDevices()[vk_instance.device_indices[device]];
-    vk::PhysicalDeviceMemoryProperties memprops = vkdev.getMemoryProperties();
     bool membudget_supported = vk_instance.device_supports_membudget[device];
 
+    vk::PhysicalDeviceMemoryProperties2 memprops;
     vk::PhysicalDeviceMemoryBudgetPropertiesEXT budgetprops;
-    vk::PhysicalDeviceMemoryProperties2 memprops2 = {};
 
     if (membudget_supported) {
-        memprops2.pNext = &budgetprops;
-        vkdev.getMemoryProperties2(&memprops2);
+        memprops.pNext = &budgetprops;
     }
+    vkdev.getMemoryProperties2(&memprops);
 
-    for (uint32_t i = 0; i < memprops.memoryHeapCount; ++i) {
-        const vk::MemoryHeap & heap = memprops.memoryHeaps[i];
+    for (uint32_t i = 0; i < memprops.memoryProperties.memoryHeapCount; ++i) {
+        const vk::MemoryHeap & heap = memprops.memoryProperties.memoryHeaps[i];
 
         if (heap.flags & vk::MemoryHeapFlagBits::eDeviceLocal) {
             *total = heap.size;
 
             if (membudget_supported && i < budgetprops.heapUsage.size()) {
-                *free = *total - budgetprops.heapUsage[i];
+                *free = budgetprops.heapBudget[i];
             } else {
                 *free = heap.size;
             }

As a sidenote, for whatever reason Intel shows a pretty low budget, despite empty VRAM:

memoryHeaps[0]:
                size   = 16810770432 (0x3ea000000) (15.66 GiB)
                budget = 14891876352 (0x377a00000) (13.87 GiB)

while AMD looks as expected:

memoryHeaps[1]:
                size   = 17163091968 (0x3ff000000) (15.98 GiB)
                budget = 17152225280 (0x3fe5a3000) (15.97 GiB)

and so does Nvidia:

memoryHeaps[0]:
                size   = 25769803776 (0x600000000) (24.00 GiB)
                budget = 25281167360 (0x5e2e00000) (23.54 GiB)

@giladgd
Copy link
Contributor Author

giladgd commented Aug 31, 2025

@0cc4m Good catch, I've only run tests from the current process, and this method seemed more precise and reported the same memory footprint across both Vulkan and CUDA backends.

It appears that budgetprops.heapBudget[i] reports the available memory budget excluding the usage of the current process, so budgetprops.heapBudget[i] - budgetprops.heapUsage[i] seems to be what we want.

I noticed that budgetprops.heapBudget[i] includes some Vulkan overhead related to the current process.
I checked budgetprops.heapBudget[i] before and after loading gpt-oss 20b mxfp4, and noticed a diff of -21.88MB.
Maybe it's worth adding an additional method to ggml_backend_device_i to check the memory usage of the current process in isolation to make it easier to inspect more precisely. I can do that in another PR.

Also, thanks for testing this with various GPUs!
I only got access to a machine with an Nvidia GPU (beside my own Mac), so I can only test on it.

@0cc4m
Copy link
Collaborator

0cc4m commented Sep 1, 2025

It appears that budgetprops.heapBudget[i] reports the available memory budget excluding the usage of the current process, so budgetprops.heapBudget[i] - budgetprops.heapUsage[i] seems to be what we want.

Oh yeah, that's true. I only thought about the initial value for layer estimations, but of course you can keep using it later in the program. Thank you.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's working as expected now, at least on AMD and Nvidia. On Intel the number doesn't change from the ~14/16 GB it shows, regardless of how loaded the GPU is. But that is a driver issue.

@0cc4m 0cc4m merged commit d4d8dbe into ggml-org:master Sep 1, 2025
45 of 48 checks passed
walidbr pushed a commit to walidbr/llama.cpp that referenced this pull request Sep 7, 2025
)

* vulkan: use memory budget extension to read memory usage

* fix: formatting and names

* formatting

* fix: detect and cache memory budget extension availability on init

* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available

* style: lints
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants