Understanding differences in benchmarks on different machines #553

davidlattimore · 2025-03-12T02:40:42Z

davidlattimore
Mar 12, 2025
Maintainer

I'm trying to get a better understanding of the differences that we see in benchmark results between @marxin's machine and my laptop.

Here's Martin's:

Here's my machine for clang without debug info:

And clang with debug info:

I put them in separate plots because they have quite different y-axis and I didn't want the non-debug links to get squashed.

I'm mostly interested in non-debug linker performance, since if someone wants fast link times and needs debug info, there are options like split debug info.

The main difference between Martin's machine and mine is that Martin's has I think 12 cores (24 threads) while my laptop has 4 cores (8 threads). My guess is that with more cores, the limiting factor becomes not compute power, but memory bandwidth and so the two heavily multithreaded linkers (mold and wild) converge.

Actually another interesting thing is that Martin's link time for clang-non-debug with lld is about 2 seconds, whereas for me it's about half a second. Perhaps something went wrong with that in Martin's run. I guess perhaps the runs were done for the purposes of trying out the graph script, and not intended to be accurate benchmarks.

As another interesting data point, here's the clang non-debug link times on my RPi, which has 4 cores.

I've got my laptop building clickhouse, but that'll be some time before I can see results for that.

marxin · 2025-03-15T09:31:03Z

marxin
Mar 15, 2025
Collaborator Sponsor

Thanks for bringing this topic up. Let's focus on the Clang binary first, where I've just built the latest git version (2f9d94981c0eb76fe2127b09351ba7b84064471c) both with clang 19 and GCC 14.2 and it seems there are huge differences in the size of the debug info emitted by the compilers. I configured the Clang built with the following set of flags:

cmake -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang -G "Unix Makefiles" ../llvm for the first experiment, cmake -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_CXX_FLAGS="-gz=zstd" -DCMAKE_C_FLAGS="-gz=zstd" -G "Unix Makefiles" ../llvm for the second one, and last one is just about replacing with -gz=zlib.

If one links clang binary with the Clang compiler, one gets:

❯ bloaty ../../../../bin/clang-21
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  45.3%   621Mi   0.0%       0    .debug_info
  17.3%   237Mi   0.0%       0    .debug_str
   8.0%   109Mi   0.0%       0    .debug_loclists
   6.7%  91.4Mi   0.0%       0    .debug_str_offsets
   5.8%  79.7Mi   0.0%       0    .debug_line
   5.4%  74.0Mi  52.8%  74.0Mi    .text
   3.5%  47.5Mi  33.9%  47.5Mi    .rodata
   2.5%  34.0Mi   0.0%       0    .debug_addr
   1.9%  26.1Mi   0.0%       0    .debug_rnglists
   1.3%  18.3Mi   0.0%       0    .strtab
   0.8%  10.5Mi   0.0%       0    .debug_abbrev
   0.6%  7.88Mi   5.6%  7.88Mi    .eh_frame
   0.4%  5.53Mi   4.0%  5.53Mi    .rela.dyn
   0.3%  4.26Mi   0.0%       0    .symtab
   0.2%  3.14Mi   2.2%  3.14Mi    .data.rel.ro
   0.1%  1.17Mi   0.8%  1.17Mi    .eh_frame_hdr
   0.0%       0   0.4%   604Ki    .bss
   0.0%   515Ki   0.0%       0    .debug_line_str
   0.0%   293Ki   0.2%   293Ki    .data
   0.0%  29.8Ki   0.0%  28.9Ki    [24 Others]
   0.0%  9.01Ki   0.0%  9.01Ki    .dynstr
 100.0%  1.34Gi 100.0%   140Mi    TOTAL

While using GCC as the compiler, the binary blows up to:

❯ bloaty ../../../../bin/clang-21
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  74.6%  3.41Gi   0.0%       0    .debug_info
   9.2%   430Mi   0.0%       0    .debug_loclists
   5.3%   246Mi   0.0%       0    .debug_str
   4.8%   222Mi   0.0%       0    .debug_line
   2.0%  92.6Mi  58.3%  92.6Mi    .text
   1.7%  78.8Mi   0.0%       0    .debug_rnglists
   1.0%  46.0Mi  29.0%  46.0Mi    .rodata
   0.5%  25.1Mi   0.0%       0    .debug_abbrev
   0.4%  19.5Mi   0.0%       0    .strtab
   0.2%  9.11Mi   5.7%  9.11Mi    .eh_frame
   0.1%  5.71Mi   3.6%  5.71Mi    .rela.dyn
   0.1%  4.87Mi   0.0%       0    .symtab
   0.1%  3.83Mi   0.0%       0    .debug_aranges
   0.1%  3.15Mi   2.0%  3.15Mi    .data.rel.ro
   0.0%  1.14Mi   0.7%  1.14Mi    .eh_frame_hdr
   0.0%       0   0.5%   752Ki    .bss
   0.0%   383Ki   0.0%       0    .debug_line_str
   0.0%   303Ki   0.2%   303Ki    .data
   0.0%  23.9Ki   0.0%  23.3Ki    [22 Others]
   0.0%  15.9Ki   0.0%  15.9Ki    .got
   0.0%  8.53Ki   0.0%  8.53Ki    .dynstr
 100.0%  4.57Gi 100.0%   158Mi    TOTAL

If I built with GCC, the linker needs to load 1.1GiB input sections and decompress it to 6.5GiB of data:
2025-03-15T09:16:27.178989Z DEBUG metrics: input_sections loaded_bytes=1151078581 loaded_compressed_bytes=1149141328 decompressed_bytes=6533849713

However, using the Clang compiler, one gets only:
2025-03-15T09:17:10.393725Z DEBUG metrics: input_sections loaded_bytes=1117322243 loaded_compressed_bytes=0 decompressed_bytes=0

Then, if I use hyperfine with 2 warmup rounds, I get the following stats (note that there's not much difference in what compression algorithm is used for debug info sections). Plus, both mold and Wild are run in --no-fork mode:

One can observe the following samply profile for Wild when linking ZSTD GCC objects (seems build-id is generated):

Then run all the linkers with -Wl,--strip-debug and yes, it seems lld is much slower because it compresses the debug info sections even though they are stripped. I can prove that in perf top:

Samply profile for Wild follows:

@davidlattimore Hope the collected data help and feel free to ask more questions if there're any doubts!

0 replies

marxin · 2025-03-15T13:07:31Z

marxin
Mar 15, 2025
Collaborator Sponsor

As another interesting data point, here's the clang non-debug link times on my RPi, which has 4 cores.

If I link Clang on my RPi5 w/o debug info, I get to the following results:

Benchmark 1: ./run-with -B ~/Programming/wild -Wl,--no-fork
  Time (mean ± σ):     729.2 ms ±   9.7 ms    [User: 1812.4 ms, System: 790.5 ms]
  Range (min … max):   714.7 ms … 748.9 ms    10 runs
 
Benchmark 2: ./run-with -fuse-ld=lld
  Time (mean ± σ):      1.855 s ±  0.023 s    [User: 2.661 s, System: 1.204 s]
  Range (min … max):    1.820 s …  1.883 s    10 runs
 
Benchmark 3: ./run-with -fuse-ld=mold -Wl,--no-fork
  Time (mean ± σ):      1.454 s ±  0.015 s    [User: 3.794 s, System: 0.817 s]
  Range (min … max):    1.437 s …  1.476 s    10 runs
 
Summary
  ./run-with -B ~/Programming/wild -Wl,--no-fork ran
    1.99 ± 0.03 times faster than ./run-with -fuse-ld=mold -Wl,--no-fork
    2.54 ± 0.05 times faster than ./run-with -fuse-ld=lld

Which aligns to your comparison, I guess!?

0 replies

davidlattimore · 2025-03-16T11:06:56Z

davidlattimore
Mar 16, 2025
Maintainer Author

That's interesting that LLD is processing all the debug info even when --strip-debug is passed. I'm surprised that I hadn't realised that before, but I can definitely am able to reproduce it.

I've done a bit of profiling today and got a few small performance wins. I've switched to mostly doing off-CPU profiling, which is interesting, because it shows what each thread is doing even when it's idle - e.g. if it's sleeping. Here's an example trace: https://share.firefox.dev/4kLpEpM (needs to be opened in firefox)

This profile was generated with the following script:

#!/bin/bash
set -e

# Run once without profiling to make sure caches are warm
"$@"

perf record -e cpu-clock,context-switches -g -F 999 -m 16M -o ~/tmp/perf.data "$@"
perf script -F +pid -i ~/tmp/perf.data > ~/tmp/out.perf

Then open out.perf in the firefox profiler.

It'd be interesting to see a similar profile from a machine with more cores.

10 replies

lqd Mar 19, 2025

It may be possible on nightly, with cargo-features = ["profile-rustflags"] and then something like

[profile.theprofile]
rustflags = ["-Cforce-frame-pointers=yes"]

marxin Mar 20, 2025
Collaborator Sponsor

@marxin your profiles don't have symbols, can you try again with either --call-graph=dwarf added to the perf or forced frame pointers when building wild? Also make sure to use correct binary with profile opt-debug, it puts binary to a different directory.

I'm sorry. I did use --call-graph=dwarf, but I wrongly used the release profile instead of debug-opt, which includes debug info symbols. My experience with call-graph=dwarf has always been positive (if you don't forget to build the stuff you measure with debugging symbols).

So there's the profile with debug info for Clang (ZSTD):
https://share.firefox.dev/420veMH

And there's one with -Wl,--strip-debug:
https://share.firefox.dev/4kBZUMD

mati865 Mar 20, 2025
Collaborator Sponsor

I tried --call-graph=dwarf yesterday and it made profile collection so slow that I ended up cancelling it and turning frame pointers back on.

Interesting, with warm cache I don't see this problem myself.
Before running perf, I'm configuring the system with:

echo '128000' | sudo tee /proc/sys/kernel/perf_event_mlock_kb
echo 0 | sudo tee /proc/sys/kernel/kptr_restrict
echo '1' | sudo tee /proc/sys/kernel/perf_event_paranoid

This results in these timings for me:

variant	time
Regular opt-debug	713 ms
Frame pointers opt-debug	731 ms
Perf with frame pointers	1.09 s
Perf with --call-graph=dwarf, without frame pointers	1.20 s
Perf with --call-graph=dwarf and frame pointers	1.21 s

(times gathered with time command, cache was warm)

perf script takes 7.5 s with Dwarf because it has to process about 200 MiB of Dwarf frames. I'm surprised it's so slow for you.

davidlattimore Mar 21, 2025
Maintainer Author

It turned out that it's actually perf script that's slow when --call-graph=dwarf is used on my machine, not perf record. I noticed that it was shelling out to addr2line and did a search for related info. This blog post does a pretty good job of explaining the slowness.

I raised an issue on cargo asking if they'd accept a profile option to set force-frame-pointers, since I don't want to add nightly-only options in our Cargo.toml.

I remembered reading an article in favour of frame pointers some time ago and just tracked it down again.

Thanks for the updated profiles @marxin. However, I notice that they don't have kernel symbols. I wonder if that's related to the commands mentioned above (kptr_restrict etc). I don't run those, but it's possible that either I did something permanent that set those up, or that my distro just has different defaults.

marxin Mar 21, 2025
Collaborator Sponsor

It turned out that it's actually perf script that's slow when --call-graph=dwarf is used on my machine, not perf record. I noticed that it was shelling out to addr2line and did a search for related info. This blog post does a pretty good job of explaining the slowness.

Btw. starting with my change to perf (6.12+) (torvalds/linux@e6b56ae), you use the Rust addr2line implementation for perf script which is blazingly fast:
https://github.com/gimli-rs/addr2line?tab=readme-ov-file#performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Understanding differences in benchmarks on different machines #553

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 10 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Understanding differences in benchmarks on different machines #553

Uh oh!

davidlattimore Mar 12, 2025 Maintainer

Replies: 3 comments · 10 replies

Uh oh!

marxin Mar 15, 2025 Collaborator Sponsor

Uh oh!

marxin Mar 15, 2025 Collaborator Sponsor

Uh oh!

davidlattimore Mar 16, 2025 Maintainer Author

Uh oh!

lqd Mar 19, 2025

Uh oh!

marxin Mar 20, 2025 Collaborator Sponsor

Uh oh!

mati865 Mar 20, 2025 Collaborator Sponsor

Uh oh!

davidlattimore Mar 21, 2025 Maintainer Author

Uh oh!

marxin Mar 21, 2025 Collaborator Sponsor

davidlattimore
Mar 12, 2025
Maintainer

Replies: 3 comments 10 replies

marxin
Mar 15, 2025
Collaborator Sponsor

marxin
Mar 15, 2025
Collaborator Sponsor

davidlattimore
Mar 16, 2025
Maintainer Author

marxin Mar 20, 2025
Collaborator Sponsor

mati865 Mar 20, 2025
Collaborator Sponsor

davidlattimore Mar 21, 2025
Maintainer Author

marxin Mar 21, 2025
Collaborator Sponsor