You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apple's Arm CPUs configured with a single NUMA node and Arm's native support for "weak memory model" make it the perfect ground for studying the cost of various concurrency synchronization primitives.
356
+
For that, we can launch the benchmark with a tiny input, such as just 1 scalar per core, and measure the overall latency of dispatching all threads, blocking, and afterwards aggregating partial results.
357
+
Assuming, some of the work scheduling happens at a cache-line granularity, instead of 1 scalar per core, we take 1 cache-line per core.
The timing methods between C++ and Rust implementation of the benchmark differ, but the relative timings are as expected, just like on [Graviton 4](#aws-graviton4-c8gmetal-24xl).
373
+
What's different, is the core design and their count!
374
+
With only 12 cores, Fork Union takes the lead over Taskflow and OpenMP.
375
+
In Rust, it's 10x faster than [Rayon](https://github.com/rayon-rs/rayon), but slower than [Async-Executor](https://github.com/smol-rs/async-executor).
0 commit comments