Understanding differences in benchmarks on different machines #553
Replies: 3 comments 10 replies
-
Thanks for bringing this topic up. Let's focus on the Clang binary first, where I've just built the latest git version (
If one links
While using GCC as the compiler, the binary blows up to:
If I built with GCC, the linker needs to load 1.1GiB input sections and decompress it to 6.5GiB of data: However, using the Clang compiler, one gets only: Then, if I use One can observe the following samply profile for Wild when linking ZSTD GCC objects (seems build-id is generated): Then run all the linkers with Samply profile for Wild follows: @davidlattimore Hope the collected data help and feel free to ask more questions if there're any doubts! |
Beta Was this translation helpful? Give feedback.
-
If I link Clang on my RPi5 w/o debug info, I get to the following results:
Which aligns to your comparison, I guess!? |
Beta Was this translation helpful? Give feedback.
-
That's interesting that LLD is processing all the debug info even when I've done a bit of profiling today and got a few small performance wins. I've switched to mostly doing off-CPU profiling, which is interesting, because it shows what each thread is doing even when it's idle - e.g. if it's sleeping. Here's an example trace: https://share.firefox.dev/4kLpEpM (needs to be opened in firefox) This profile was generated with the following script: #!/bin/bash
set -e
# Run once without profiling to make sure caches are warm
"$@"
perf record -e cpu-clock,context-switches -g -F 999 -m 16M -o ~/tmp/perf.data "$@"
perf script -F +pid -i ~/tmp/perf.data > ~/tmp/out.perf Then open It'd be interesting to see a similar profile from a machine with more cores. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to get a better understanding of the differences that we see in benchmark results between @marxin's machine and my laptop.
Here's Martin's:
Here's my machine for clang without debug info:
And clang with debug info:
I put them in separate plots because they have quite different y-axis and I didn't want the non-debug links to get squashed.
I'm mostly interested in non-debug linker performance, since if someone wants fast link times and needs debug info, there are options like split debug info.
The main difference between Martin's machine and mine is that Martin's has I think 12 cores (24 threads) while my laptop has 4 cores (8 threads). My guess is that with more cores, the limiting factor becomes not compute power, but memory bandwidth and so the two heavily multithreaded linkers (mold and wild) converge.
Actually another interesting thing is that Martin's link time for clang-non-debug with lld is about 2 seconds, whereas for me it's about half a second. Perhaps something went wrong with that in Martin's run. I guess perhaps the runs were done for the purposes of trying out the graph script, and not intended to be accurate benchmarks.
As another interesting data point, here's the clang non-debug link times on my RPi, which has 4 cores.
I've got my laptop building clickhouse, but that'll be some time before I can see results for that.
Beta Was this translation helpful? Give feedback.
All reactions