Skip to content

Conversation

@christiangnrd
Copy link
Member

@christiangnrd christiangnrd commented Jul 25, 2025

By reusing the same shared event, reduces the number of allocations.

Hopefully slightly improves performance, but this is more in line with how I believe queue synchronization will have to work with Metal 4.

Edit: performance slightly worse.

@christiangnrd christiangnrd changed the title Synchronize using MTLSharedEvents Synchronize using MTLSharedEvents Jul 25, 2025
@github-actions
Copy link
Contributor

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/lib/mtl/events.jl b/lib/mtl/events.jl
index 374f4285..8d2345ed 100644
--- a/lib/mtl/events.jl
+++ b/lib/mtl/events.jl
@@ -29,9 +29,11 @@ function MTLSharedEvent(dev::MTLDevice)
     return obj
 end
 
-function waitUntilSignaledValue(ev::MTLSharedEvent, value, timeoutMS=typemax(UInt64))
-    @objc [ev::id{MTLSharedEvent} waitUntilSignaledValue:value::UInt64
-                        timeoutMS:timeoutMS::UInt64]::Bool
+function waitUntilSignaledValue(ev::MTLSharedEvent, value, timeoutMS = typemax(UInt64))
+    return @objc [
+        ev::id{MTLSharedEvent} waitUntilSignaledValue:value::UInt64
+        timeoutMS:timeoutMS::UInt64
+    ]::Bool
 end
 
 ## shared event handle
diff --git a/src/state.jl b/src/state.jl
index 0782db68..918731c4 100644
--- a/src/state.jl
+++ b/src/state.jl
@@ -61,7 +61,7 @@ end
 Return the `MTLSharedEvent` used to synchronize a queue
 """
 function queue_event(queue::MTLCommandQueue)
-    get!(task_local_storage(), (:MTLSharedEvent, queue)) do
+    return get!(task_local_storage(), (:MTLSharedEvent, queue)) do
         MTLSharedEvent(queue.device)
     end::MTLSharedEvent
 end
@@ -82,7 +82,7 @@ First-In-First-Out manner, this synchronizes the GPU.
     cmdbuf = MTLCommandBuffer(queue)
     MTL.encode_signal!(cmdbuf, ev, val)
     commit!(cmdbuf)
-    MTL.waitUntilSignaledValue(ev,val)
+    MTL.waitUntilSignaledValue(ev, val)
     return
 end
 

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: 4d18869 Previous: 422c43f Ratio
latency/precompile 9811076333 ns 9811445750 ns 1.00
latency/ttfp 3976837999.5 ns 3973553937 ns 1.00
latency/import 1277901146 ns 1273914917 ns 1.00
integration/metaldevrt 820625 ns 919250 ns 0.89
integration/byval/slices=1 1530333 ns 1620625 ns 0.94
integration/byval/slices=3 8915625 ns 9296666.5 ns 0.96
integration/byval/reference 1523333 ns 1601437.5 ns 0.95
integration/byval/slices=2 2546958 ns 2670791 ns 0.95
kernel/indexing 593166.5 ns 668834 ns 0.89
kernel/indexing_checked 578084 ns 663708 ns 0.87
kernel/launch 8000 ns 9042 ns 0.88
array/construct 6000 ns 6125 ns 0.98
array/broadcast 566750 ns 635500 ns 0.89
array/random/randn/Float32 807937 ns 815104 ns 0.99
array/random/randn!/Float32 608417 ns 635375 ns 0.96
array/random/rand!/Int64 543917 ns 572750 ns 0.95
array/random/rand!/Float32 576000 ns 597958 ns 0.96
array/random/rand/Int64 777688 ns 733583 ns 1.06
array/random/rand/Float32 621020.5 ns 620604 ns 1.00
array/accumulate/Int64/1d 1285687.5 ns 1328542 ns 0.97
array/accumulate/Int64/dims=1 1813167 ns 1915812.5 ns 0.95
array/accumulate/Int64/dims=2 2151708 ns 2252917 ns 0.96
array/accumulate/Int64/dims=1L 11634083.5 ns 11629709 ns 1.00
array/accumulate/Int64/dims=2L 9686812 ns 9677895.5 ns 1.00
array/accumulate/Float32/1d 1121125 ns 1230791 ns 0.91
array/accumulate/Float32/dims=1 1550417 ns 1648666.5 ns 0.94
array/accumulate/Float32/dims=2 1865291 ns 1970042 ns 0.95
array/accumulate/Float32/dims=1L 9850937 ns 9975938 ns 0.99
array/accumulate/Float32/dims=2L 7285729 ns 7387104.5 ns 0.99
array/reductions/reduce/Int64/1d 1432625 ns 1582500 ns 0.91
array/reductions/reduce/Int64/dims=1 1045959 ns 1150791.5 ns 0.91
array/reductions/reduce/Int64/dims=2 1170750 ns 1302979 ns 0.90
array/reductions/reduce/Int64/dims=1L 2074667 ns 2153834 ns 0.96
array/reductions/reduce/Int64/dims=2L 3416958 ns 3552729 ns 0.96
array/reductions/reduce/Float32/1d 988125 ns 1047958 ns 0.94
array/reductions/reduce/Float32/dims=1 797437.5 ns 875334 ns 0.91
array/reductions/reduce/Float32/dims=2 745167 ns 801083 ns 0.93
array/reductions/reduce/Float32/dims=1L 1733083 ns 1800646 ns 0.96
array/reductions/reduce/Float32/dims=2L 1754125 ns 1880520.5 ns 0.93
array/reductions/mapreduce/Int64/1d 1445437.5 ns 1555709 ns 0.93
array/reductions/mapreduce/Int64/dims=1 1057979.5 ns 1139791 ns 0.93
array/reductions/mapreduce/Int64/dims=2 1179167 ns 1276833.5 ns 0.92
array/reductions/mapreduce/Int64/dims=1L 2092312 ns 2162875 ns 0.97
array/reductions/mapreduce/Int64/dims=2L 3438542 ns 3576375 ns 0.96
array/reductions/mapreduce/Float32/1d 1003208 ns 1077521 ns 0.93
array/reductions/mapreduce/Float32/dims=1 783083 ns 869417 ns 0.90
array/reductions/mapreduce/Float32/dims=2 747333.5 ns 796458 ns 0.94
array/reductions/mapreduce/Float32/dims=1L 1731000 ns 1791958 ns 0.97
array/reductions/mapreduce/Float32/dims=2L 1767250 ns 1892687.5 ns 0.93
array/private/copyto!/gpu_to_gpu 629667 ns 654917 ns 0.96
array/private/copyto!/cpu_to_gpu 798334 ns 822333 ns 0.97
array/private/copyto!/gpu_to_cpu 777667 ns 815083 ns 0.95
array/private/iteration/findall/int 1641458 ns 1662667 ns 0.99
array/private/iteration/findall/bool 1443125 ns 1504500 ns 0.96
array/private/iteration/findfirst/int 1817375 ns 1894188 ns 0.96
array/private/iteration/findfirst/bool 1792250 ns 1818166 ns 0.99
array/private/iteration/scalar 3896583.5 ns 4215375 ns 0.92
array/private/iteration/logical 2543354 ns 2643833 ns 0.96
array/private/iteration/findmin/1d 1844125 ns 1959625 ns 0.94
array/private/iteration/findmin/2d 1411020.5 ns 1525062.5 ns 0.93
array/private/copy 550250 ns 563000 ns 0.98
array/shared/copyto!/gpu_to_gpu 85792 ns 83167 ns 1.03
array/shared/copyto!/cpu_to_gpu 86791 ns 83500 ns 1.04
array/shared/copyto!/gpu_to_cpu 82958 ns 82792 ns 1.00
array/shared/iteration/findall/int 1647291 ns 1691958 ns 0.97
array/shared/iteration/findall/bool 1467375 ns 1524959 ns 0.96
array/shared/iteration/findfirst/int 1368521 ns 1420354.5 ns 0.96
array/shared/iteration/findfirst/bool 1362166.5 ns 1403958 ns 0.97
array/shared/iteration/scalar 209375 ns 158834 ns 1.32
array/shared/iteration/logical 2354042 ns 2536750 ns 0.93
array/shared/iteration/findmin/1d 1400750 ns 1418937.5 ns 0.99
array/shared/iteration/findmin/2d 1423666 ns 1545646 ns 0.92
array/shared/copy 251708 ns 246937.5 ns 1.02
array/permutedims/4d 2434292 ns 2531354 ns 0.96
array/permutedims/2d 1169312.5 ns 1248000 ns 0.94
array/permutedims/3d 1723729.5 ns 1810958 ns 0.95
metal/synchronization/stream 19000 ns 14875 ns 1.28
metal/synchronization/context 20292 ns 15584 ns 1.30

This comment was automatically generated by workflow using github-action-benchmark.

@christiangnrd christiangnrd marked this pull request as draft July 26, 2025 20:00
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Is there a Metal3->4 migration document anywhere indicating we should switch from MTLCommandBuffer::waitUntilCompleted to a shared event?

@christiangnrd christiangnrd mentioned this pull request Jul 29, 2025
3 tasks
@christiangnrd
Copy link
Member Author

Is there a Metal3->4 migration document anywhere indicating we should switch from MTLCommandBuffer::waitUntilCompleted to a shared event?

Nope. I couldn't find anything from Apple's documentation that mentioned full command queue synchronization. Seems like they're more focused on more granular synchronization with the new API.

The benchmarks seem to indicate that this is slower for Metal 3, so maybe we leave it as-is for metal 3 and only transition to event-based synchronization for Metal 4 command queues. What made me believe that this is how they expect us to do it with Metal 4 is that MTL4CommandQueue has a new signalEvent method which allows to signal without creating a command buffer.

I'm closing this PR to avoid polluting the PR list, and I linked to this PR in #612 for when I (or someone else) get around to finishing metal 4 support.

We can always reopen if someone thinks we should use MTLSharedEvents for Metal 3 synchronization. Locally and using the crude @time Metal.synchronize() in the repl, shared events was giving me similar performance with more consistent timings than with waitForCompleted, but these benchmarks show a regression so I'll trust the benchmarks for now

@maleadt
Copy link
Member

maleadt commented Jul 30, 2025

these benchmarks show a regression so I'll trust the benchmarks for now

Synchronization itself seems slower, but all other operations seem slightly faster. Or is this just the noise of the benchmarks? It's awfully consistent. And I could imagine the microbenchmark being slower while real-use application being beneficial.

@christiangnrd christiangnrd reopened this Aug 1, 2025
@christiangnrd
Copy link
Member Author

all other operations seem slightly faster

Good point. I'll rerun and if results are similar I'll merge

@christiangnrd christiangnrd marked this pull request as ready for review August 1, 2025 07:31
@codecov
Copy link

codecov bot commented Aug 1, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 80.96%. Comparing base (422c43f) to head (4d18869).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/state.jl 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #633      +/-   ##
==========================================
- Coverage   80.97%   80.96%   -0.01%     
==========================================
  Files          61       61              
  Lines        2712     2722      +10     
==========================================
+ Hits         2196     2204       +8     
- Misses        516      518       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@christiangnrd christiangnrd merged commit 1942968 into main Aug 1, 2025
7 checks passed
@christiangnrd christiangnrd deleted the events branch August 1, 2025 12:38
christiangnrd added a commit that referenced this pull request Aug 1, 2025
christiangnrd added a commit that referenced this pull request Aug 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants