Skip to content

Conversation

@0ax1
Copy link
Contributor

@0ax1 0ax1 commented Jan 27, 2026

The previous copy to host operations was synchronous but did not wait for other previous operations on the stream to complete. cuMemcpyDtoH_v2 runs on the default stream and therefore raced with the buffer handle associated stream.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 requested a review from joseph-isaacs January 27, 2026 11:28
@0ax1 0ax1 added the changelog/fix A bug fix label Jan 27, 2026
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 27, 2026

Merging this PR will degrade performance by 21.57%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 11 improved benchmarks
❌ 7 regressed benchmarks
✅ 1247 untouched benchmarks
⏩ 1219 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime u8_FoR[10M] 7.3 µs 6.6 µs +11.78%
Simulation canonical_into_non_nullable[(10000, 1, 0.0)] 36.1 µs 24.8 µs +45.44%
Simulation canonical_into_non_nullable[(10000, 1, 0.01)] 41.1 µs 31.3 µs +31.33%
Simulation canonical_into_non_nullable[(10000, 10, 0.01)] 306.1 µs 221.3 µs +38.31%
Simulation canonical_into_non_nullable[(10000, 10, 0.1)] 471.6 µs 380.8 µs +23.85%
Simulation canonical_into_non_nullable[(10000, 1, 0.1)] 57.1 µs 47.1 µs +21.12%
Simulation canonical_into_non_nullable[(10000, 10, 0.0)] 278.8 µs 194.3 µs +43.5%
Simulation canonical_into_nullable[(10000, 100, 0.0)] 4.3 ms 4.9 ms -12.42%
Simulation into_canonical_non_nullable[(10000, 1, 0.0)] 32.2 µs 41.1 µs -21.57%
Simulation into_canonical_non_nullable[(10000, 1, 0.1)] 54.6 µs 63.8 µs -14.54%
Simulation into_canonical_non_nullable[(10000, 1, 0.01)] 38.4 µs 47.3 µs -18.75%
Simulation into_canonical_non_nullable[(10000, 10, 0.01)] 310.6 µs 229.3 µs +35.47%
Simulation into_canonical_non_nullable[(10000, 10, 0.0)] 283.2 µs 201.6 µs +40.49%
Simulation into_canonical_nullable[(10000, 10, 0.0)] 456.5 µs 538.5 µs -15.23%
Simulation into_canonical_nullable[(10000, 10, 0.1)] 717.4 µs 629.3 µs +14.01%
Simulation into_canonical_nullable[(10000, 100, 0.1)] 6.1 ms 6.9 ms -12.03%
Simulation into_canonical_non_nullable[(10000, 10, 0.1)] 472.8 µs 385.1 µs +22.78%
Simulation into_canonical_nullable[(10000, 100, 0.0)] 4.3 ms 5.1 ms -14.47%

Comparing ad/fix-sync-copy (63ba235) with develop (78bec43)

Open in CodSpeed

Footnotes

  1. 1219 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 enabled auto-merge (squash) January 27, 2026 11:44
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
@0ax1 0ax1 merged commit ab95dc9 into develop Jan 27, 2026
44 of 46 checks passed
@0ax1 0ax1 deleted the ad/fix-sync-copy branch January 27, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants