Cargo.toml: move codegen-units to release profile by oech3 · Pull Request #9161 · uutils/coreutils

oech3 · 2025-11-06T14:25:47Z

Retrying this since CodSpeed was unstable previously. Release page should provide this binary.

codspeed-hq · 2025-11-06T14:45:29Z

CodSpeed Performance Report

Merging #9161 will degrade performances by 10.02%

_{Comparing oech3:patch-1 (3a6b706) with main (677fd95)¹}

Summary

⚡ 22 improvements
❌ 10 regressions
✅ 93 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
⚡	`cut_fields_custom_delim`	40.7 ms	30.8 ms	+32.29%
⚡	`cut_fields_tab`	34.1 ms	27 ms	+26.13%
❌	`expand_custom_tabstops[50000]`	36.6 ms	37.6 ms	-2.73%
❌	`expand_many_lines[100000]`	148.1 ms	151.7 ms	-2.41%
❌	`seq_custom_separator`	28.7 ms	30 ms	-4.63%
⚡	`seq_formatted`	97.2 ms	94.6 ms	+2.74%
❌	`seq_integers`	26.1 ms	27.5 ms	-5.06%
❌	`seq_with_step`	13.3 ms	14 ms	-4.98%
⚡	`sort_accented_data[500000]`	362.4 ms	355.1 ms	+2.06%
⚡	`sort_ascii_only[500000]`	355.4 ms	344.9 ms	+3.04%
⚡	`sort_case_insensitive[500000]`	278.6 ms	269.4 ms	+3.4%
⚡	`sort_case_sensitive[500000]`	174.2 ms	165.4 ms	+5.28%
⚡	`sort_mixed_data[500000]`	327.3 ms	318.5 ms	+2.75%
⚡	`sort_reverse_locale[500000]`	363.9 ms	352.8 ms	+3.16%
⚡	`sort_ascii_c_locale`	21.5 ms	19.6 ms	+9.44%
⚡	`sort_ascii_utf8_locale`	43 ms	39.3 ms	+9.44%
⚡	`sort_german_c_locale`	38.4 ms	37.5 ms	+2.38%
⚡	`sort_german_locale`	39.1 ms	38.2 ms	+2.33%
⚡	`sort_mixed_c_locale`	38.3 ms	37.4 ms	+2.32%
⚡	`sort_mixed_utf8_locale`	38.8 ms	37.9 ms	+2.31%
...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

No successful run was found on main (8983b90) during the generation of this report, so 677fd95 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

oech3 · 2025-11-06T14:48:38Z

@naoNao89 Is there any idea about the CodSpeed's result?

github-actions · 2025-11-06T16:29:26Z

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

oech3 · 2025-11-07T03:53:32Z

Thankyou. But this PR is forcing 1 cgu but causing regressions even it is recommended. Should this be open as an issue at rust?

…

Multiple CGUs with aggressive optimization can cause unexpected regressions.

github-actions · 2025-11-07T13:23:19Z

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/overlay-headers (passes in this run but fails in the 'main' branch)

oech3 · 2025-11-08T04:09:40Z

uutils use

coreutils/Cargo.toml

Line 11 in adcfcfa

rust-version = "1.85.0"

Same regressions for latest~nightly?

naoNao89 · 2025-11-08T07:41:35Z

@oech3, can you help me test on the x86 platform? Currently, I've tested on ARM64 (macOS).

STEP 1: GET THE CODE

use std::ops::Add;

#[derive(Clone, Debug)]
struct ComplexNum {
    data: [u64; 4],
}

impl Add for ComplexNum {
    type Output = Self;
    #[inline]
    fn add(self, other: Self) -> Self {
        let mut result = [0u64; 4];
        let mut carry = 0u64;
        for i in 0..4 {
            let sum = self.data[i] as u128 + other.data[i] as u128 + carry as u128;
            result[i] = sum as u64;
            carry = (sum >> 64) as u64;
        }
        ComplexNum { data: result }
    }
}

impl Ord for ComplexNum {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        for i in (0..4).rev() {
            match self.data[i].cmp(&other.data[i]) {
                std::cmp::Ordering::Equal => continue,
                other => return other,
            }
        }
        std::cmp::Ordering::Equal
    }
}

#[inline(never)]
fn compute_sequence(first: ComplexNum, increment: ComplexNum, last: ComplexNum) -> u64 {
    let mut value = first;
    let mut sum = 0u64;
    while value <= last {
        sum = sum.wrapping_add(value.data[0]);
        value = value + increment.clone();
    }
    sum
}

fn main() {
    let iterations = 100;
    let first = ComplexNum::new(1);
    let increment = ComplexNum::new(1);
    let last = ComplexNum::new(100000);
    
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        let _ = compute_sequence(first.clone(), increment.clone(), last.clone());
    }
    let duration = start.elapsed();
    
    println!("Total time: {:?}", duration);
    println!("Per iteration: {:?}", duration / iterations as u32);
}

STEP 2: COMPILE DEFAULT BUILD

Command:
rustc --edition 2021 -C opt-level=3 profiling_test.rs -o prof_default

Output:
(no output if successful)

STEP 3: RUN DEFAULT BUILD

Command:
./prof_default

Output you will see:
Iterations: 100
Total time: 21.286125ms
Average time per iteration: 212.861µs
Iterations per second: 4697.90

STEP 4: COMPILE LTO BUILD

Command:
rustc --edition 2021 -C opt-level=3 -C codegen-units=1 -C lto=fat profiling_test.rs -o prof_lto

Output:
(takes longer, LTO compilation is slow)

STEP 5: RUN LTO BUILD

Command:
./prof_lto

Output you will see:
Iterations: 100
Total time: 18.877333ms
Average time per iteration: 188.773µs
Iterations per second: 5297.36

STEP 6: COMPARE RESULTS

STEP 7: MEASURE WALL CLOCK TIME

Command (Default):
time ./prof_default

Output:
Iterations: 100
Total time: 21.286125ms
Average time per iteration: 212.861µs
Iterations per second: 4697.90
real 0m0.844s
user 0m0.023s
sys 0m0.002s

Command (LTO):
time ./prof_lto

PERFORMANCE RESULTS

ARM64 (Apple Silicon) (my testing)
Default: 21.286ms
LTO: 18.877ms
Improvement: -11.3% (FASTER)

x86_64 (Intel/AMD - Original Report):
Default: 26.1ms
LTO: 27.5ms
Regression: +5.06% (SLOWER)

REGISTER ANALYSIS

Code needs: 17 registers

ARM64 (32 available):
Utilization: 53%
Pressure: LOW
Spilling: NO
Result: LTO helps

x86_64 (16 available):
Utilization: 106% (OVERFLOW)
Pressure: HIGH
Spilling: YES
Result: LTO hurts

LLVM IR COMPARISON

Default Build: 206 lines
LTO Build: 206 lines
Difference: 0% (IDENTICAL)

Assembly Comparison:

Default Build: 107 lines
LTO Build: 107 lines
Difference: 0% (IDENTICAL)

Hot Loop: 41 instructions (both)
Registers Used: 17/31 (both)
Stack Spills: 0 (both)

MECHANISM

Without LTO:
Conservative register allocation
Some stack spilling
Performance: 26.1ms (x86_64)

With LTO:
Aggressive inlining
More live values
Need 17+ registers
x86_64 only has 16
MORE stack spilling
Performance: 27.5ms (x86_64)

Chain:
LTO → More inlining → More live values → More register pressure
→ More stack spilling → Slower memory access → Performance loss

ARCHITECTURE DIFFERENCES

ARM64 (ARM64 has 2x more registers):
GPRs: 32
L1 I-Cache: 192 KB
L1 D-Cache: 128 KB
Pipeline: 8-stage
LTO Effect: +11.3%

x86_64:
GPRs: 16
L1 I-Cache: 32 KB
L1 D-Cache: 32 KB
Pipeline: 14-19 stage
LTO Effect: -5%

CONCLUSION

Bug is REAL but PLATFORM-SPECIFIC

ARM64 (32 registers):
LTO improves performance (+11.3%)
No register pressure
Optimization works well

x86_64 (16 registers):
LTO hurts performance (-5%)
High register pressure
More stack spilling

There for:
More registers → LTO helps
Fewer registers → LTO hurts

naoNao89 · 2025-11-08T07:43:15Z

Run Multiple Times for Stability

for i in {1..5}; do ./prof_default; done
for i in {1..5}; do ./prof_lto; done

Calculate average for more accurate results

oech3 · 2025-11-08T08:09:12Z

Not compilable your .rs on 1.91.0 (Arch Linux)... But I can RUSTFLAGS="-C... " cargo --profile=... build -p uu_... instead.

oech3 force-pushed the patch-1 branch from afba092 to e050404 Compare November 6, 2025 16:10

oech3 mentioned this pull request Nov 7, 2025

Slower performance caused only by using LTO rust-lang/rust#48371

Open

Cargo.toml: move codegen-units to release profile

3a6b706

oech3 force-pushed the patch-1 branch from e050404 to 3a6b706 Compare November 7, 2025 11:27

naoNao89 mentioned this pull request Nov 7, 2025

codegen-units=1 + LTO causes 3-5% performance regression for sequential code rust-lang/rust#148670

Open

oech3 closed this Nov 9, 2025

oech3 deleted the patch-1 branch November 9, 2025 05:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cargo.toml: move codegen-units to release profile#9161

Cargo.toml: move codegen-units to release profile#9161
oech3 wants to merge 1 commit intouutils:mainfrom
oech3:patch-1

oech3 commented Nov 6, 2025

Uh oh!

codspeed-hq bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

oech3 commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

oech3 commented Nov 7, 2025 via email

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

oech3 commented Nov 8, 2025 •

edited

Loading

Uh oh!

naoNao89 commented Nov 8, 2025

Uh oh!

naoNao89 commented Nov 8, 2025

Uh oh!

oech3 commented Nov 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

oech3 commented Nov 6, 2025

Uh oh!

codspeed-hq bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #9161 will degrade performances by 10.02%

Summary

Benchmarks breakdown

Footnotes

Uh oh!

oech3 commented Nov 6, 2025

Uh oh!

github-actions bot commented Nov 6, 2025

Uh oh!

oech3 commented Nov 7, 2025 via email

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

oech3 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoNao89 commented Nov 8, 2025

Uh oh!

naoNao89 commented Nov 8, 2025

Uh oh!

oech3 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Nov 6, 2025 •

edited

Loading

oech3 commented Nov 8, 2025 •

edited

Loading

oech3 commented Nov 8, 2025 •

edited

Loading