Skip to content

Conversation

JTischbein
Copy link
Contributor

The PR #16308 sets as device property by default integrated = false to disable host buffers. While this change is needed, the additional memory copies introduce multiple cudaStreamSynchronize calls. With the default scheduling strategy, each synchronization has a small latency between kernel termination and freeing the CPU thread on sm121, leading to a ~15% performance regression on gpt-oss-20b-mxfp4. This can be fixed by setting the default scheduling strategy to cudaDeviceScheduleSpin. With the missing latency for each synchronization, the performance is roughly equal as handling the device as integrated.

This code change checks whether the device has compute capability 12.1 and then sets the CUDA flag cudaDeviceScheduleSpin.

@JTischbein JTischbein requested a review from slaren as a code owner October 15, 2025 06:32
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 15, 2025
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 15, 2025

From cuda_runtime_api.h:

 * - ::cudaDeviceScheduleAuto: The default value if the \p flags parameter is
 * zero, uses a heuristic based on the number of active CUDA contexts in the
 * process \p C and the number of logical processors in the system \p P. If
 * \p C \> \p P, then CUDA will yield to other OS threads when waiting for the
 * device, otherwise CUDA will not yield while waiting for results and
 * actively spin on the processor. Additionally, on Tegra devices,
 * ::cudaDeviceScheduleAuto uses a heuristic based on the power profile of
 * the platform and may choose ::cudaDeviceScheduleBlockingSync for low-powered
 * devices.
 * - ::cudaDeviceScheduleSpin: Instruct CUDA to actively spin when waiting for
 * results from the device. This can decrease latency when waiting for the
 * device, but may lower the performance of CPU threads if they are performing
 * work in parallel with the CUDA thread.
 * - ::cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread 
 * on a synchronization primitive when waiting for the device to finish work.

Is this change still needed if you set the operating systems power settings to something like "prefer maximum performance"?

@ggerganov
Copy link
Member

I can confirm that this patch improves generation performance on NVIDIA DGX Spark with gpt-oss:

Model Test t/s master t/s pr/16585 Speedup
gpt-oss 120B MXFP4 MoE pp2048 1640.04 1694.05 1.03
gpt-oss 120B MXFP4 MoE tg32 38.31 45.78 1.19
gpt-oss 20B MXFP4 MoE pp2048 3610.01 3614.70 1.00
gpt-oss 20B MXFP4 MoE tg32 54.72 79.95 1.46

@JTischbein JTischbein force-pushed the cuda_device_scheduling_spin branch from a67a2c0 to a33e305 Compare October 15, 2025 08:40
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// CUBLAS_CHECK(cublasLoggerConfigure(1, 1, 0, nullptr));

// Setting device scheduling strategy for iGPUs to "spinning" to avoid delays in cuda synchronize calls.
// This fix is temporary, as the strategy will be the default in later drivers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This fix is temporary, as the strategy will be the default in later drivers.
// This fix is temporary, as the strategy will be the default in later drivers.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge there are no mobile devices using compute capability 12.1 so it should be fine to set cudaDeviceScheduleSpin unconditionally.

#endif // GGML_CUDA_FORCE_CUBLAS
GGML_LOG_INFO("%s: found %d " GGML_CUDA_NAME " devices:\n", __func__, info.device_count);

bool is_cc121 = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool is_cc121 = false;
bool device_schedule_spin = false;

cudaDeviceProp prop;
CUDA_CHECK(cudaGetDeviceProperties(&prop, id));

is_cc121 |= prop.major == 12 && prop.minor == 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
is_cc121 |= prop.major == 12 && prop.minor == 1;
// Depending on the CUDA drivers the DGX Spark can run with a device schedule that prefers low power use.
// However, as it is plugged into a wall it should prefer maximum performance.
// TODO: add a check for a future driver version where this is fixed to avoid thrashing for > 20 CUDA contexts.
device_schedule_spin = prop.major == 12 && prop.minor == 1;

Comment on lines 300 to 304
// Setting device scheduling strategy for iGPUs to "spinning" to avoid delays in cuda synchronize calls.
// This fix is temporary, as the strategy will be the default in later drivers.
if (is_cc121) {
CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceScheduleSpin));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Setting device scheduling strategy for iGPUs to "spinning" to avoid delays in cuda synchronize calls.
// This fix is temporary, as the strategy will be the default in later drivers.
if (is_cc121) {
CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceScheduleSpin));
}
if (device_schedule_spin) {
CUDA_CHECK(cudaSetDeviceFlags(cudaDeviceScheduleSpin));
}

@JohannesGaessler
Copy link
Collaborator

HIP CI

In that case, I would say to just do away with the boolean and to just call cudaSetDeviceFlags where you are currently setting it.

@JTischbein
Copy link
Contributor Author

I specified the comment a bit more and removed the boolean check. I wanted to avoid multiple calls, but as sm121 is an iGPU only one device should match the condition. Thank you for the feedback.

JTischbein and others added 2 commits October 15, 2025 12:49
Co-authored-by: Johannes Gäßler <[email protected]>
Co-authored-by: Johannes Gäßler <[email protected]>
@ggerganov ggerganov merged commit 5acd455 into ggml-org:master Oct 15, 2025
63 of 64 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* CUDA set scheduling strategy to spinning for cc121

* Using prop.major and prop.minor, include HIP and MUSA

* Exclude HIP and MUSA

* Remove trailing whitespace

Co-authored-by: Johannes Gäßler <[email protected]>

* Remove empty line

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 15, 2025
* origin/master:
Add server-driven parameter defaults and syncing (ggml-org#16515)
metal: optimise `GGML_OP_SUM` (ggml-org#16559)
server : fix img token logs (ggml-org#16595)
llama-quant: add support for mmproj (ggml-org#16592)
CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585)
server : fix mtmd checkpoints (ggml-org#16591)
metal : avoid using Metal's gpuAddress property (ggml-org#16576)
vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)
vulkan: Support FA with K/V in F32 (ggml-org#16543)
vulkan: Improve build time for MSVC (ggml-org#16545)
CUDA: enable FA for FP32 KV cache (ggml-org#16546)
CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)
CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)
cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)
server : dynamic token limit for prompt cache (ggml-org#16560)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants