-
Notifications
You must be signed in to change notification settings - Fork 13.4k
CUDA: Changing the CUDA scheduling strategy to spin #16585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Changing the CUDA scheduling strategy to spin #16585
Conversation
From
Is this change still needed if you set the operating systems power settings to something like "prefer maximum performance"? |
I can confirm that this patch improves generation performance on NVIDIA DGX Spark with
|
a67a2c0
to
a33e305
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
// CUBLAS_CHECK(cublasLoggerConfigure(1, 1, 0, nullptr)); | ||
|
||
// Setting device scheduling strategy for iGPUs to "spinning" to avoid delays in cuda synchronize calls. | ||
// This fix is temporary, as the strategy will be the default in later drivers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This fix is temporary, as the strategy will be the default in later drivers. | |
// This fix is temporary, as the strategy will be the default in later drivers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge there are no mobile devices using compute capability 12.1 so it should be fine to set cudaDeviceScheduleSpin
unconditionally.
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
#endif // GGML_CUDA_FORCE_CUBLAS | ||
GGML_LOG_INFO("%s: found %d " GGML_CUDA_NAME " devices:\n", __func__, info.device_count); | ||
|
||
bool is_cc121 = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is_cc121 = false; | |
bool device_schedule_spin = false; |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
cudaDeviceProp prop; | ||
CUDA_CHECK(cudaGetDeviceProperties(&prop, id)); | ||
|
||
is_cc121 |= prop.major == 12 && prop.minor == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_cc121 |= prop.major == 12 && prop.minor == 1; | |
// Depending on the CUDA drivers the DGX Spark can run with a device schedule that prefers low power use. | |
// However, as it is plugged into a wall it should prefer maximum performance. | |
// TODO: add a check for a future driver version where this is fixed to avoid thrashing for > 20 CUDA contexts. | |
device_schedule_spin = prop.major == 12 && prop.minor == 1; |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
// Setting device scheduling strategy for iGPUs to "spinning" to avoid delays in cuda synchronize calls. | ||
// This fix is temporary, as the strategy will be the default in later drivers. | ||
if (is_cc121) { | ||
CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceScheduleSpin)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Setting device scheduling strategy for iGPUs to "spinning" to avoid delays in cuda synchronize calls. | |
// This fix is temporary, as the strategy will be the default in later drivers. | |
if (is_cc121) { | |
CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceScheduleSpin)); | |
} | |
if (device_schedule_spin) { | |
CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceScheduleSpin)); | |
} |
In that case, I would say to just do away with the boolean and to just call |
I specified the comment a bit more and removed the boolean check. I wanted to avoid multiple calls, but as sm121 is an iGPU only one device should match the condition. Thank you for the feedback. |
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
* CUDA set scheduling strategy to spinning for cc121 * Using prop.major and prop.minor, include HIP and MUSA * Exclude HIP and MUSA * Remove trailing whitespace Co-authored-by: Johannes Gäßler <[email protected]> * Remove empty line Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
* origin/master: Add server-driven parameter defaults and syncing (ggml-org#16515) metal: optimise `GGML_OP_SUM` (ggml-org#16559) server : fix img token logs (ggml-org#16595) llama-quant: add support for mmproj (ggml-org#16592) CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585) server : fix mtmd checkpoints (ggml-org#16591) metal : avoid using Metal's gpuAddress property (ggml-org#16576) vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) vulkan: Support FA with K/V in F32 (ggml-org#16543) vulkan: Improve build time for MSVC (ggml-org#16545) CUDA: enable FA for FP32 KV cache (ggml-org#16546) CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) server : dynamic token limit for prompt cache (ggml-org#16560)
The PR #16308 sets as device property by default
integrated = false
to disable host buffers. While this change is needed, the additional memory copies introduce multiplecudaStreamSynchronize
calls. With the default scheduling strategy, each synchronization has a small latency between kernel termination and freeing the CPU thread on sm121, leading to a ~15% performance regression on gpt-oss-20b-mxfp4. This can be fixed by setting the default scheduling strategy tocudaDeviceScheduleSpin
. With the missing latency for each synchronization, the performance is roughly equal as handling the device as integrated.This code change checks whether the device has compute capability 12.1 and then sets the CUDA flag
cudaDeviceScheduleSpin
.