-
Notifications
You must be signed in to change notification settings - Fork 13.4k
ggml : remove assert for AArch64 GEMV and GEMM Q4 kernels #9217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi, it looks like the checks failed for server-windows, but after reviewing the logs, it seems unrelated to my changes. Could someone with more context on the server-windows take a look? I'm happy to rebase or make any necessary adjustments if needed. Thanks! |
ggml/src/ggml-aarch64.c
Outdated
// Print a given message only once | ||
static inline void print_message_once(const char* message) { | ||
static bool printed = false; | ||
if (!printed) { | ||
fprintf(stderr, "%s\n", message); | ||
printed = true; | ||
} | ||
} | ||
|
||
// Return The number of 32-bit lanes in the SVE vector if SVE is supported; otherwise, returns 0 if SVE is not supported. | ||
static int sve_lane_count(void) { | ||
static int lane_count = -1; | ||
if (lane_count == -1) { | ||
#if defined(__ARM_FEATURE_SVE) | ||
lane_count = ggml_sve_cnt_b; | ||
#else | ||
lane_count = 0; | ||
#endif | ||
} | ||
return lane_count; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these will generate data race warnings when the thread sanitizer is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thanks for reviewing the code and I'll update to make the function thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov I have pushed the commit that addresses your concern about the data race condition.
1c50977
to
a53cac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fallback changes are good, but the prints are not OK for multiple reasons:
- Sometimes we might want to run a given quantization type, regardless if there is a more optimal option. Yet, there would still be a warning message.
- The low-level code should not deal with printing. This logic ideally has to be moved into the user code and let the application decide if and how to warn the user.
- There is still a data race on
warning_message
My suggestion is to reduce the PR just to the fallback related changes.
@ggerganov thanks for your review. I have pushed the commit based on your suggestions. The patch is a reduced PR that only removes the assert and adds the fallback. There will be a separate PR that addresses the warning messages in the user code, suggesting that a different model could be used for better performance, if it applies. |
GGML_ASSERT(!(ggml_cpu_has_sve() && (ggml_sve_cnt_b == QK8_0)) && | ||
"__ARM_FEATURE_SVE defined, use the Q4_0_8_8 quantization format for optimal performance"); | ||
#if ! ((defined(_MSC_VER)) && ! defined(__clang__)) && defined(__aarch64__) && defined(__ARM_NEON) | ||
if (ggml_cpu_has_neon()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if these function calls are properly inlined. AFAIK with LTO enabled, they should be, but maybe it's better if instead of relying on the compilationlinker to do it for us, we can read the value once into a static variable and check that variable from then on.
The function call overhead is probably negligible, but still, since we are in a hot loop, it might make a differnce. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thanks for the review and sorry for the late response as I was on vacation. I'll look into your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible way to reduce the overhead of function calls such as ggml_cpu_has_neon() is to cache the results using static variables, as you suggested:
static bool neon_support_checked = false;
static bool neon_supported = false;
To prevent race conditions, the values of neon_support_checked and neon_supported need to be protected by a mutex or atomic operations:
static pthread_mutex_t neon_check_mutex = PTHREAD_MUTEX_INITIALIZER;
inline bool is_neon_supported() {
pthread_mutex_lock(&neon_check_mutex);
// Check only if not already checked
if (!neon_support_checked) {
neon_supported = ggml_cpu_has_neon();
neon_support_checked = true;
}
pthread_mutex_unlock(&neon_check_mutex); // Release the lock
return neon_supported;
}
By marking is_neon_supported() as inline, we may reduce the function call overhead. However, since the function still contains a mutex, the overall performance improvement could be limited due to the cost of acquiring and releasing the lock.
Another possible optimization might be to inline the ggml_cpu_has_neon() function itself, although that might be beyond the scope of this PR.
I'm interested to hear what you think, and I'm happy to consider any suggestions or improvements you may have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, mutex is an overkill. Let's merge it as it is. You can easily check if the function calls add any overhead by replacing with if (true)
. If they do, you can try to find a way to do thread-safe static init without synchronization primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve done some overhead tests for the function calls and didn’t observe any statistically significant performance differences. Given that, I’m fine with merging the code as is after I rebase it. Please let me know if you think I should do anything differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's merge after rebase.
* added fallback mechanism when the offline re-quantized model is not optimized for the underlying target.
71cf0e1
to
7276e2b
Compare
@ggerganov Thanks so much for taking the time to review and merge my PR. I really appreciate your feedback and guidance throughout the process. Looking forward to contributing more in the future. |
@chaxu01 Welcome and looking forward to more contributions in the future. We still don't have CI for this code due to lack of hardware, so the review process is more difficult and takes longer. |
Added a fallback mechanism for cases where the offline re-quantized model is not optimized for the underlying target for AArch64 GEMV and GEMM. In such situations, the mulmat will fallback to a suitable implementation, with the reference implementation as a last resort, instead of triggering an assert. This commit also aligns with an upcoming PR that dynamically detects CPU features.