-
Notifications
You must be signed in to change notification settings - Fork 2
Update taskflow and benchmark implementations for v4.0 #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @tzcnt thank you for the pull request! I will update it to our benchmarks too 👍 Yes, I am curious if you will see any difference when your compile with |
|
I needed to edit this line https://github.com/taskflow/taskflow/blob/ce3a65c24aba10dbd608877d02b42b5091cc5a02/taskflow/core/worker.hpp#L37 to build with TF_ENABLE_ATOMIC_NOTIFIER (it's missing a semicolon). Unfortunately it didn't seem to make a difference in performance either way. For all tests, v4.0 was slower than v3.10 when using the old implementation of the benchmark. When updating the benchmark to use TaskGroup, it then became faster. I use |
|
I've updated the full benchmark results with v4.0 and the new implementations. It's faster or unchanged on every machine + benchmark combo, with one exception: on my 64-core EPYC, fib(39) became slower on v4.0 regardless of the implementation. Note that this benchmark has very high run-to-run variance regardless of implementation or version, but the variance seems to have become much worse now. v3.10 : 250 - 520ms, but the mode is 250ms, with slow runs only being occasional This particular CPU has multiple internal latency domains with high penalty for out-of-domain access so I would suspect that the high number of atomic operations in fib is causing the issue. |
|
Edit: I just tried running fib(39) on v3.10, on the EPYC 7742, with a modification that replaces the 2nd call to size_t x, y;
rt.silent_async([&x, n](tf::Runtime& s) { x = fib(n - 1, s); });
y = fib(n - 2, rt);
rt.corun_all();And on v3.10 that reduces the runtime from ~250ms down to ~145ms. This is a huge win. However when I run the same code on v4.0.0, the runtime blows up to 800-1200ms, whereas the TaskGroup based implementation runs at 350ms. |
|
Update: I tried again with -DTF_ENABLE_ATOMIC_NOTIFIER on v4.0.0 and this reduced the runtime of the TaskGroup + inline call based fib down to ~200ms, although it did not resolve the regression on the runtime + inline call implementation. I'll re-run all the benchmarks with DTF_ENABLE_ATOMIC_NOTIFIER enabled on this machine then... |
|
Thank you for the great feedback! Yes, the tail optimization brings a big difference: |
|
I've completed the update and re-bench. The -DTF_ATOMIC_NOTIFIER is the default bench result now at https://fleetcode.com/runtime-benchmarks/ . You can view a comparison of comparison of 3.10 vs 4.0 with and without TF_ATOMIC_NOTIFIER here (sorry about the structure - my infra isn't setup to show multiple variants of the same runtime on the same graph at the moment, you'll have to tab through the "machines" instead): https://fleetcode.com/runtime-benchmarks/bench_tf/ |
Numbers are from my 13600k machine, using 14 threads.
@tsung-wei-huang nice improvements with the new TaskGroup. Something interesting I noticed - when I upgraded to v4.0.0 without changing the code, it ran a fair bit slower. However when using the new TaskGroup, it's much faster. I didn't get a chance to test on any other machines yet to determine if that is a more general issue or just something with this machine.
Feel free to review the updated implementations to ensure that I've got the correct usage for best performance.