Update taskflow and benchmark implementations for v4.0 #5

tzcnt · 2026-01-02T01:10:27Z

Numbers are from my 13600k machine, using 14 threads.

bench	v3.10.0	v4.0.0	v4.0.0 (with TaskGroup)
fib(39)	786850 us	1123443 us	549697 us
skynet(8)	553122 us	719660 us	557148 us
nqueens(14)	681552 us	753478 us	371835 us
matmul(2048)	91774 us	92967 us	90775 us

@tsung-wei-huang nice improvements with the new TaskGroup. Something interesting I noticed - when I upgraded to v4.0.0 without changing the code, it ran a fair bit slower. However when using the new TaskGroup, it's much faster. I didn't get a chance to test on any other machines yet to determine if that is a more general issue or just something with this machine.

Feel free to review the updated implementations to ensure that I've got the correct usage for best performance.

tsung-wei-huang · 2026-01-02T04:15:13Z

Hi @tzcnt thank you for the pull request! I will update it to our benchmarks too 👍 Yes, tf::TaskGroup is a better choice for doing recursive task parallelism. I am not sure what happened exactly to the runtime-based implementation, but I guess it's probably due to the change of notifier.

I am curious if you will see any difference when your compile with -DTF_ENABLE_ATOMIC_NOTIFIER=1 (default is off since some of our users observe strange behaviors due to a known gcc libc++ bug). Let me know! - but for now you can stick with tf::TaskGroup-based implementation.

tzcnt · 2026-01-09T20:53:05Z

I needed to edit this line https://github.com/taskflow/taskflow/blob/ce3a65c24aba10dbd608877d02b42b5091cc5a02/taskflow/core/worker.hpp#L37 to build with TF_ENABLE_ATOMIC_NOTIFIER (it's missing a semicolon). Unfortunately it didn't seem to make a difference in performance either way. For all tests, v4.0 was slower than v3.10 when using the old implementation of the benchmark. When updating the benchmark to use TaskGroup, it then became faster.

I use std::atomic::wait() / std::atomic::notify() as my notifier in TooManyCooks and haven't observed any issues with it using libstdc++. For these benchmarks specifically, the threads should be awake and able to find work for nearly the entire benchmark run, so it has minimal impact.

tzcnt · 2026-01-09T21:07:08Z

I've updated the full benchmark results with v4.0 and the new implementations. It's faster or unchanged on every machine + benchmark combo, with one exception: on my 64-core EPYC, fib(39) became slower on v4.0 regardless of the implementation. Note that this benchmark has very high run-to-run variance regardless of implementation or version, but the variance seems to have become much worse now.

v3.10 : 250 - 520ms, but the mode is 250ms, with slow runs only being occasional
v4.0 : 680 - 1580ms, the mode is ~800ms
v4.0 w/ TaskGroup : 315 - 380ms, the mode is ~350ms

This particular CPU has multiple internal latency domains with high penalty for out-of-domain access so I would suspect that the high number of atomic operations in fib is causing the issue.

tzcnt · 2026-01-09T21:22:38Z

Edit: I just tried running fib(39) on v3.10, on the EPYC 7742, with a modification that replaces the 2nd call to rt.silent_async with an inline call as per your TaskGroup documentation, resulting in this:

  size_t x, y;
  rt.silent_async([&x, n](tf::Runtime& s) { x = fib(n - 1, s); });
  y = fib(n - 2, rt);
  rt.corun_all();

And on v3.10 that reduces the runtime from ~250ms down to ~145ms. This is a huge win. However when I run the same code on v4.0.0, the runtime blows up to 800-1200ms, whereas the TaskGroup based implementation runs at 350ms.

tzcnt · 2026-01-09T21:24:27Z

Update: I tried again with -DTF_ENABLE_ATOMIC_NOTIFIER on v4.0.0 and this reduced the runtime of the TaskGroup + inline call based fib down to ~200ms, although it did not resolve the regression on the runtime + inline call implementation.

I'll re-run all the benchmarks with DTF_ENABLE_ATOMIC_NOTIFIER enabled on this machine then...

tsung-wei-huang · 2026-01-09T21:48:33Z

Thank you for the great feedback! Yes, the tail optimization brings a big difference:

  size_t x, y;
  rt.silent_async([&x, n](tf::Runtime& s) { x = fib(n - 1, s); });
  y = fib(n - 2, rt);
  rt.corun_all();

tzcnt · 2026-01-09T22:29:22Z

I've completed the update and re-bench. The -DTF_ATOMIC_NOTIFIER is the default bench result now at https://fleetcode.com/runtime-benchmarks/ .

You can view a comparison of comparison of 3.10 vs 4.0 with and without TF_ATOMIC_NOTIFIER here (sorry about the structure - my infra isn't setup to show multiple variants of the same runtime on the same graph at the moment, you'll have to tab through the "machines" instead): https://fleetcode.com/runtime-benchmarks/bench_tf/

update taskflow and benchmark implementations for v4.0

03b65e0

tzcnt force-pushed the taskflow-4.0 branch from b1b5533 to 03b65e0 Compare January 8, 2026 16:54

improve threads_sweep breakpoint selection

41d6866

tzcnt added 2 commits January 9, 2026 13:26

formatting

d310cd4

set -DTF_ENABLE_ATOMIC_NOTIFIER

152ccf9

tzcnt merged commit 8d1934c into main Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update taskflow and benchmark implementations for v4.0 #5

Update taskflow and benchmark implementations for v4.0 #5

tzcnt commented Jan 2, 2026 •

edited

Loading

Uh oh!

tsung-wei-huang commented Jan 2, 2026 •

edited

Loading

Uh oh!

tzcnt commented Jan 9, 2026 •

edited

Loading

Uh oh!

tzcnt commented Jan 9, 2026 •

edited

Loading

Uh oh!

tzcnt commented Jan 9, 2026

Uh oh!

tzcnt commented Jan 9, 2026

Uh oh!

tsung-wei-huang commented Jan 9, 2026

Uh oh!

tzcnt commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update taskflow and benchmark implementations for v4.0 #5

Update taskflow and benchmark implementations for v4.0 #5

Conversation

tzcnt commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsung-wei-huang commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzcnt commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzcnt commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzcnt commented Jan 9, 2026

Uh oh!

tzcnt commented Jan 9, 2026

Uh oh!

tsung-wei-huang commented Jan 9, 2026

Uh oh!

tzcnt commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tzcnt commented Jan 2, 2026 •

edited

Loading

tsung-wei-huang commented Jan 2, 2026 •

edited

Loading

tzcnt commented Jan 9, 2026 •

edited

Loading

tzcnt commented Jan 9, 2026 •

edited

Loading