-
Notifications
You must be signed in to change notification settings - Fork 5.1k
vulkan: fix SIGSEGV on Android when using buffer_device_address #3590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@0cc4m Does this patch look OK? |
|
That would be a question for @jeffbolznv, who implemented buffer_device_address. |
|
Wait before merge, I think it breaks Android emulator: |
|
Please share a vulkaninfo log. I think we should only be calling this function if it's enabled in the 1.2 features, in which case the extension shouldn't be necessary. |
7622b03 to
1ee93ca
Compare
I updated my PR text with both vulkaninfo (arm64 and x86_64 emulator). |
|
I think what's happening is this device doesn't actually support Vulkan 1.2: so it's technically out of spec for what we intended to support, but if it's simple to get it working I don't mind. I'll leave a few suggestions in a review. |
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
| getenv("GGML_VK_ENABLE_MEMORY_PRIORITY")) { | ||
| device->memory_priority = true; | ||
| } else if (strcmp("VK_KHR_buffer_device_address", properties.extensionName) == 0) { | ||
| buffer_device_address_ext = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest renaming this to buffer_device_address_khr, to avoid confusion with VK_EXT_buffer_device_address.
|
|
||
| device_extensions.push_back("VK_KHR_16bit_storage"); | ||
| if (fp16_storage) { | ||
| device_extensions.push_back("VK_KHR_16bit_storage"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extension was promoted to Vulkan 1.1. Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on x86_64 emulator, without this change:
Abort message: 'terminating due to uncaught exception of type vk::ExtensionNotPresentError: vk::PhysicalDevice::createDevice: ErrorExtensionNotPresent'
#00 abort+191 libc.so
#01 abort_message libcxxabi/src/abort_message.cpp:78
#02 demangling_terminate_handler() libcxxabi/src/cxa_default_handlers.cpp:72
#03 ggml_uncaught_exception() ggml/src/ggml.cpp:11
#04 std::__terminate() libcxxabi/src/cxa_handlers.cpp:59
#05 __cxxabiv1::failed_throw() libcxxabi/src/cxa_exception.cpp:152
#06 __cxa_throw libcxxabi/src/cxa_exception.cpp:294
#07 vk::throwResultException() vulkan/vulkan.hpp:6544
vk::resultCheck() vulkan/vulkan.hpp:6738
vk::PhysicalDevice::createDevice() vulkan/vulkan_funcs.hpp:412
ggml_vk_get_device() ggml-vulkan/ggml-vulkan.cpp:4935
#08 ggml_backend_vk_buffer_type ggml-vulkan/ggml-vulkan.cpp:12636
#09 ggml_backend_vk_device_get_buffer_type() ggml-vulkan/ggml-vulkan.cpp:14002
#10 ggml_backend_dev_buffer_type ggml-backend.cpp:517
#11 make_buft_list() whisper.cpp:1374
#12 whisper_model_load() whisper.cpp:1712
#13 whisper_init_with_params_no_state whisper.cpp:3723
#14 whisper_init_with_params whisper.cpp:3766
#15 process_load_model_command jni.c:523
worker_thread_func jni.c:752
#16 __pthread_start libc.so
#17 __start_thread libc.so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it ought to work if we don't push this extension string at all, but this change is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The capability is definitely required, the backend cannot run without fp16_storage. If the way we are currently initialising it with the extension does not work for you, please find another way (like not pushing the string at all). Making it conditional is just confusing.
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
| #endif | ||
|
|
||
| if (device->fp16) { | ||
| if (device->fp16 && fp16_compute) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant since device->fp16 = !force_disable_f16 && fp16_storage && fp16_compute;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
| // Required on some Android devices, if extension exists (it's core in Vulkan 1.2) | ||
| if (device->buffer_device_address && buffer_device_address_ext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest changing to comment to something like:
// Required for physical devices that only support Vulkan 1.1
and removing the device->buffer_device_address check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where we currently set device->buffer_device_address = vk12_features.bufferDeviceAddress;, I think you should change this to ... || buffer_device_address_khr. I'm surprised it comes back as supported, it might just be an uninitialized value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #3590 (comment)
It seems, we need to push the extension even if we don't use it ? Ok that's weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH no, that's not the same crash, we were crashing when using the function pointer that was NULL here.
so: vk12_features.bufferDeviceAddress is true, but fp was NULL. Without pushExt and dispatcher.init(device).
| device->device = device->physical_device.createDevice(device_create_info); | ||
|
|
||
| // optionally initialize the dispatcher with a vk::Device to get | ||
| // device-specific function pointers (needed on Android) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually needed? Do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed, I double-checked by removing it and keeping other fixes: same crash.
Note that this is also needed on a WIP branch on VLC for android, I use volk and I had to call volkLoadDevice(device) (in addition to volkLoadInstance(instance), see https://github.com/zeux/volk?tab=readme-ov-file#optimizing-device-calls (but that does not say why it is needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned this won't work on systems with multiple vulkan devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's a problem...
vkGetBufferDeviceAddress is not loaded without explicit VK_KHR_buffer_device_address extension Also init disaptcher with device as (optionally) documented in https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers- ``` ********** Crash dump: ********** #00 0x0000000000000000 <unknown> ggml-org#1 0x000000000317c7b4 libggml-vulkan.so vk::Device::getBufferAddress<vk::DispatchLoaderDynamic>(vk::BufferDeviceAddressInfo const&, vk::DispatchLoaderDynamic const&) const vulkan/vulkan_funcs.hpp:6563:30 ggml_vk_create_buffer(std::shared_ptr<vk_device_struct>&, unsigned long, std::initializer_list<vk::Flags<vk::MemoryPropertyFlagBits>> const&) ggml/src/ggml-vulkan/ggml-vulkan.cpp:2469:40 ggml-org#2 0x000000000317b728 libggml-vulkan.so ggml_vk_create_buffer_device(std::shared_ptr<vk_device_struct>&, unsigned long) ggml/src/ggml-vulkan/ggml-vulkan.cpp:2497:19 ggml-org#3 0x000000000317b4b0 libggml-vulkan.so ggml_backend_vk_buffer_type_alloc_buffer(ggml_backend_buffer_type*, unsigned long) ggml/src/ggml-vulkan/ggml-vulkan.cpp:12591:22 ggml-org#4 0x000000000006f3a8 libggml-base.so alloc_tensor_range ggml/src/ggml-alloc.c:1135:36 ggml-org#5 0x000000000006e608 libggml-base.so ggml_backend_alloc_ctx_tensors_from_buft_impl ggml/src/ggml-alloc.c:1206:27 ggml-org#6 0x000000000006e70c libggml-base.so ggml_backend_alloc_ctx_tensors_from_buft ggml/src/ggml-alloc.c:1244:12 ggml-org#7 0x0000000000037d04 libwhisper.so whisper_model_load(whisper_model_loader*, whisper_context&) src/whisper.cpp:1852:37 ggml-org#8 0x000000000003653c libwhisper.so whisper_init_with_params_no_state src/whisper.cpp:3723:10 ggml-org#9 0x0000000000038768 libwhisper.so whisper_init_with_params src/whisper.cpp:3766:29 ```
Fix SIGABRT on Android emulator: ``` terminating due to uncaught exception of type vk::ExtensionNotPresentError: vk::PhysicalDevice::createDevice: ErrorExtensionNotPresent ```
1ee93ca to
2d7a110
Compare
| device->device = device->physical_device.createDevice(device_create_info); | ||
|
|
||
| // optionally initialize the dispatcher with a vk::Device to get | ||
| // device-specific function pointers (needed on Android) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned this won't work on systems with multiple vulkan devices.
|
|
||
| device_extensions.push_back("VK_KHR_16bit_storage"); | ||
| if (fp16_storage) { | ||
| device_extensions.push_back("VK_KHR_16bit_storage"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it ought to work if we don't push this extension string at all, but this change is fine.
| } | ||
|
|
||
| // Required for physical devices that only support Vulkan 1.1 | ||
| if (device->buffer_device_address && buffer_device_address_khr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check whether buffer_device_address is initialized based on stack garbage in VkPhysicalDeviceVulkan12Features? Another option might be to just zero-init that structure and then you don't need the getBufferDeviceAddress function pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same isssue if I do VkPhysicalDeviceVulkan12Features vk12_features {}; (and all other Features structs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was not the case, then how did we get buffer_device_address set to true in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand your comment.
Locally, I initialized all Features structures to 0. buffer_device_address was set to true, so this means this extension is present. This is confirmed by vulkaninfo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is why, without any of your changes, is device->buffer_device_address == true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, and I still think it's probably just an uninitialized structure causing us to think it's supported.
Initialized from where ? Bug on android side ?
Because I tried with a version that initialize all structs {}, and this extension is on vulkaninfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you initialize vk12_features with {}, on the Vulkan 1.1 implementation nothing should be filling out the members. I'm very surprised if it does, and maybe you could set a memory breakpoint to find out where that's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes senses. I printed enumerateInstanceVersion it is 4206592 = VK_API_VERSION_1_3
ggml-vulkan is testing loader version and not the actual device ? But if we change that, this will disable most android devices. I can confirm ggml works well with Vulkan GPU.
I tested 3 different devices for info (all crashing without my patches).
I also confirm that only forcing buffer_device_address to false fix the crash (without needing extra dispatcher init).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question: Is 1.3 Loader aware of 1.2 features ? So that 1.2 features are loaded and can be used on 1.1 devices
If that's the case, I think we should keep that features for android devices.
I could store vkGetDeviceProcAddr(device, "vkGetBufferDeviceAddress"); to remove the extra VULKAN_HPP_DEFAULT_DISPATCHER.init if you agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml-vulkan is testing loader version and not the actual device ?
I think it is, and that's probably not quite what we want. ggml-vulkan generally checks for any optional features it relies on (like bufferDeviceAddress - optional in Vulkan 1.2) but if something is mandatory in Vulkan 1.2 we may not check for it.
My question: Is 1.3 Loader aware of 1.2 features ? So that 1.2 features are loaded and can be used on 1.1 devices
I don't think the loader emulates features structures. I'd be very surprised if that's the case.
I'd still like to better understand how vk12_features.bufferDeviceAddress is getting set to true.
vkGetBufferDeviceAddress is not loaded without explicit VK_KHR_buffer_device_address extension
Also init disaptcher with device as (optionally) documented in https://github.com/KhronosGroup/Vulkan-Hpp?tab=readme-ov-file#extensions--per-device-function-pointers-
Android arm64:
Android x86_64 emulator:
Tested on Linux desktop and Android.