Skip to content

Conversation

@System233
Copy link

Users can manually specify the memory usage of a device using the GGML_VK_DEVICE{idx}_MEMORY environment variable, based on their specific needs, to allocate the workload accordingly.

For example, setting GGML_VK_DEVICE0_MEMORY=2000000000 configures 2000MB of memory on the Vulkan0 device, and correspondingly, a 2000MB workload is allocated for model computation on that device.

Especially in environments with integrated graphics, users no longer need to reboot and enter the BIOS to configure VRAM, nor do they need to worry about portions of memory allocated as VRAM being idle. By simply using the GGML_VK_DEVICE{idx}_MEMORY environment variable to manually configure the memory amount, this provides a significant benefit for future devices equipped with high-performance integrated graphics.

Finally, maintain the original behavior when GGML_VK_DEVICE{idx}_MEMORY is not set, just like GGML_VK_VISIBLE_DEVICES.

…ting device memory to manually allocate workload.
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 14, 2025
@wbruna
Copy link
Contributor

wbruna commented Feb 16, 2025

Since my 3400G seems to behave the same for any GGML_VK_DEVICE0_MEMORY value above 0, I guess this only matters for splitting across devices. But if that's the case, perhaps it'd be better to extend LLAMA_ARG_TENSOR_SPLIT to accept absolute values?

users no longer need to reboot and enter the BIOS to configure VRAM

That's unfortunately not quite the same: I notice significant speed improvements with reserved VRAM instead of shared memory (see some of my tests on the Vulkan speed discussion: #10879 (comment)).

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 17, 2025

I think tensor split is the only thing this affects (apart from some log lines), and not in a different way to the tensor-split argument.

@System233 Can you give more detail on what you are trying to solve here? Are you using multiple GPUs? In the single-GPU case this doesn't do anything, apart from changing the memory that the application reports.

@System233
Copy link
Author

Apologies, it was my mistake. Previously, I was using the Vulkan backend for Llama in LMStudio, and I removed LMS's GPU device filtering, allowing it to call the 780M integrated GPU. However, it consistently only allocated 768MB of VRAM as reported by Llama. I didn’t want to assign too much dedicated VRAM to the integrated GPU in the BIOS, so I modified Llama's code, then recompiled and replaced LMS's Vulkan backend.

@0cc4m You were right. Using llama-cli with the -ts parameter makes it very convenient to configure the workload for each device. I’m really sorry about that.
@wbruna Thank you for providing the benchmark. This is the first time I’ve seen a comparison of the performance difference between dedicated and shared VRAM on integrated GPUs. I previously thought there wasn’t much of a difference, and I even believed allocating large amounts of VRAM to integrated graphics was unnecessary.

@System233 System233 closed this Feb 17, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Feb 17, 2025

Apologies, it was my mistake. Previously, I was using the Vulkan backend for Llama in LMStudio, and I removed LMS's GPU device filtering, allowing it to call the 780M integrated GPU. However, it consistently only allocated 768MB of VRAM as reported by Llama. I didn’t want to assign too much dedicated VRAM to the integrated GPU in the BIOS, so I modified Llama's code, then recompiled and replaced LMS's Vulkan backend.

I understand, so LMStudio uses the amount of available memory in some way. They were probably just thinking of dedicated GPUs, since integrated GPUs can use more than just a small portion of RAM that is dedicated to them.

If you can manually set the amount of GPU layers in LMStudio you should be able to set it much higher on your 780M than what the VRAM size would indicate. You should be able to use up at least up to half of your RAM without any BIOS changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants