Skip to content

Conversation

@tzcnt
Copy link
Owner

@tzcnt tzcnt commented Jan 2, 2026

Numbers are from my 13600k machine, using 14 threads.

bench v3.10.0 v4.0.0 v4.0.0 (with TaskGroup)
fib(39) 786850 us 1123443 us 549697 us
skynet(8) 553122 us 719660 us 557148 us
nqueens(14) 681552 us 753478 us 371835 us
matmul(2048) 91774 us 92967 us 90775 us

@tsung-wei-huang nice improvements with the new TaskGroup. Something interesting I noticed - when I upgraded to v4.0.0 without changing the code, it ran a fair bit slower. However when using the new TaskGroup, it's much faster. I didn't get a chance to test on any other machines yet to determine if that is a more general issue or just something with this machine.

Feel free to review the updated implementations to ensure that I've got the correct usage for best performance.

@tsung-wei-huang
Copy link

tsung-wei-huang commented Jan 2, 2026

Hi @tzcnt thank you for the pull request! I will update it to our benchmarks too 👍 Yes, tf::TaskGroup is a better choice for doing recursive task parallelism. I am not sure what happened exactly to the runtime-based implementation, but I guess it's probably due to the change of notifier.

I am curious if you will see any difference when your compile with -DTF_ENABLE_ATOMIC_NOTIFIER=1 (default is off since some of our users observe strange behaviors due to a known gcc libc++ bug). Let me know! - but for now you can stick with tf::TaskGroup-based implementation.

@tzcnt
Copy link
Owner Author

tzcnt commented Jan 9, 2026

I needed to edit this line https://github.com/taskflow/taskflow/blob/ce3a65c24aba10dbd608877d02b42b5091cc5a02/taskflow/core/worker.hpp#L37 to build with TF_ENABLE_ATOMIC_NOTIFIER (it's missing a semicolon). Unfortunately it didn't seem to make a difference in performance either way. For all tests, v4.0 was slower than v3.10 when using the old implementation of the benchmark. When updating the benchmark to use TaskGroup, it then became faster.

I use std::atomic::wait() / std::atomic::notify() as my notifier in TooManyCooks and haven't observed any issues with it using libstdc++. For these benchmarks specifically, the threads should be awake and able to find work for nearly the entire benchmark run, so it has minimal impact.

@tzcnt
Copy link
Owner Author

tzcnt commented Jan 9, 2026

I've updated the full benchmark results with v4.0 and the new implementations. It's faster or unchanged on every machine + benchmark combo, with one exception: on my 64-core EPYC, fib(39) became slower on v4.0 regardless of the implementation. Note that this benchmark has very high run-to-run variance regardless of implementation or version, but the variance seems to have become much worse now.

v3.10 : 250 - 520ms, but the mode is 250ms, with slow runs only being occasional
v4.0 : 680 - 1580ms, the mode is ~800ms
v4.0 w/ TaskGroup : 315 - 380ms, the mode is ~350ms

This particular CPU has multiple internal latency domains with high penalty for out-of-domain access so I would suspect that the high number of atomic operations in fib is causing the issue.

@tzcnt
Copy link
Owner Author

tzcnt commented Jan 9, 2026

Edit: I just tried running fib(39) on v3.10, on the EPYC 7742, with a modification that replaces the 2nd call to rt.silent_async with an inline call as per your TaskGroup documentation, resulting in this:

  size_t x, y;
  rt.silent_async([&x, n](tf::Runtime& s) { x = fib(n - 1, s); });
  y = fib(n - 2, rt);
  rt.corun_all();

And on v3.10 that reduces the runtime from ~250ms down to ~145ms. This is a huge win. However when I run the same code on v4.0.0, the runtime blows up to 800-1200ms, whereas the TaskGroup based implementation runs at 350ms.

@tzcnt
Copy link
Owner Author

tzcnt commented Jan 9, 2026

Update: I tried again with -DTF_ENABLE_ATOMIC_NOTIFIER on v4.0.0 and this reduced the runtime of the TaskGroup + inline call based fib down to ~200ms, although it did not resolve the regression on the runtime + inline call implementation.

I'll re-run all the benchmarks with DTF_ENABLE_ATOMIC_NOTIFIER enabled on this machine then...

@tsung-wei-huang
Copy link

Thank you for the great feedback! Yes, the tail optimization brings a big difference:

  size_t x, y;
  rt.silent_async([&x, n](tf::Runtime& s) { x = fib(n - 1, s); });
  y = fib(n - 2, rt);
  rt.corun_all();

@tzcnt
Copy link
Owner Author

tzcnt commented Jan 9, 2026

I've completed the update and re-bench. The -DTF_ATOMIC_NOTIFIER is the default bench result now at https://fleetcode.com/runtime-benchmarks/ .

You can view a comparison of comparison of 3.10 vs 4.0 with and without TF_ATOMIC_NOTIFIER here (sorry about the structure - my infra isn't setup to show multiple variants of the same runtime on the same graph at the moment, you'll have to tab through the "machines" instead): https://fleetcode.com/runtime-benchmarks/bench_tf/

@tzcnt tzcnt merged commit 8d1934c into main Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants