Skip to content

Cargo.toml: move codegen-units to release profile#9161

Closed
oech3 wants to merge 1 commit intouutils:mainfrom
oech3:patch-1
Closed

Cargo.toml: move codegen-units to release profile#9161
oech3 wants to merge 1 commit intouutils:mainfrom
oech3:patch-1

Conversation

@oech3
Copy link
Contributor

@oech3 oech3 commented Nov 6, 2025

Retrying this since CodSpeed was unstable previously. Release page should provide this binary.

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 6, 2025

CodSpeed Performance Report

Merging #9161 will degrade performances by 10.02%

Comparing oech3:patch-1 (3a6b706) with main (677fd95)1

Summary

⚡ 22 improvements
❌ 10 regressions
✅ 93 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
cut_fields_custom_delim 40.7 ms 30.8 ms +32.29%
cut_fields_tab 34.1 ms 27 ms +26.13%
expand_custom_tabstops[50000] 36.6 ms 37.6 ms -2.73%
expand_many_lines[100000] 148.1 ms 151.7 ms -2.41%
seq_custom_separator 28.7 ms 30 ms -4.63%
seq_formatted 97.2 ms 94.6 ms +2.74%
seq_integers 26.1 ms 27.5 ms -5.06%
seq_with_step 13.3 ms 14 ms -4.98%
sort_accented_data[500000] 362.4 ms 355.1 ms +2.06%
sort_ascii_only[500000] 355.4 ms 344.9 ms +3.04%
sort_case_insensitive[500000] 278.6 ms 269.4 ms +3.4%
sort_case_sensitive[500000] 174.2 ms 165.4 ms +5.28%
sort_mixed_data[500000] 327.3 ms 318.5 ms +2.75%
sort_reverse_locale[500000] 363.9 ms 352.8 ms +3.16%
sort_ascii_c_locale 21.5 ms 19.6 ms +9.44%
sort_ascii_utf8_locale 43 ms 39.3 ms +9.44%
sort_german_c_locale 38.4 ms 37.5 ms +2.38%
sort_german_locale 39.1 ms 38.2 ms +2.33%
sort_mixed_c_locale 38.3 ms 37.4 ms +2.32%
sort_mixed_utf8_locale 38.8 ms 37.9 ms +2.31%
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Footnotes

  1. No successful run was found on main (8983b90) during the generation of this report, so 677fd95 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@oech3
Copy link
Contributor Author

oech3 commented Nov 6, 2025

@naoNao89 Is there any idea about the CodSpeed's result?

@github-actions
Copy link

github-actions bot commented Nov 6, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/tail/overlay-headers (fails in this run but passes in the 'main' branch)

@oech3
Copy link
Contributor Author

oech3 commented Nov 7, 2025 via email

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/tail/overlay-headers (passes in this run but fails in the 'main' branch)

@oech3
Copy link
Contributor Author

oech3 commented Nov 8, 2025

uutils use

rust-version = "1.85.0"

Same regressions for latest~nightly?

@naoNao89
Copy link
Contributor

naoNao89 commented Nov 8, 2025

@oech3, can you help me test on the x86 platform? Currently, I've tested on ARM64 (macOS).

STEP 1: GET THE CODE

use std::ops::Add;

#[derive(Clone, Debug)]
struct ComplexNum {
    data: [u64; 4],
}

impl Add for ComplexNum {
    type Output = Self;
    #[inline]
    fn add(self, other: Self) -> Self {
        let mut result = [0u64; 4];
        let mut carry = 0u64;
        for i in 0..4 {
            let sum = self.data[i] as u128 + other.data[i] as u128 + carry as u128;
            result[i] = sum as u64;
            carry = (sum >> 64) as u64;
        }
        ComplexNum { data: result }
    }
}

impl Ord for ComplexNum {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
        for i in (0..4).rev() {
            match self.data[i].cmp(&other.data[i]) {
                std::cmp::Ordering::Equal => continue,
                other => return other,
            }
        }
        std::cmp::Ordering::Equal
    }
}

#[inline(never)]
fn compute_sequence(first: ComplexNum, increment: ComplexNum, last: ComplexNum) -> u64 {
    let mut value = first;
    let mut sum = 0u64;
    while value <= last {
        sum = sum.wrapping_add(value.data[0]);
        value = value + increment.clone();
    }
    sum
}

fn main() {
    let iterations = 100;
    let first = ComplexNum::new(1);
    let increment = ComplexNum::new(1);
    let last = ComplexNum::new(100000);
    
    let start = std::time::Instant::now();
    for _ in 0..iterations {
        let _ = compute_sequence(first.clone(), increment.clone(), last.clone());
    }
    let duration = start.elapsed();
    
    println!("Total time: {:?}", duration);
    println!("Per iteration: {:?}", duration / iterations as u32);
}

STEP 2: COMPILE DEFAULT BUILD

Command:
rustc --edition 2021 -C opt-level=3 profiling_test.rs -o prof_default

Output:
(no output if successful)

STEP 3: RUN DEFAULT BUILD

Command:
./prof_default

Output you will see:
Iterations: 100
Total time: 21.286125ms
Average time per iteration: 212.861µs
Iterations per second: 4697.90

STEP 4: COMPILE LTO BUILD

Command:
rustc --edition 2021 -C opt-level=3 -C codegen-units=1 -C lto=fat profiling_test.rs -o prof_lto

Output:
(takes longer, LTO compilation is slow)

STEP 5: RUN LTO BUILD

Command:
./prof_lto

Output you will see:
Iterations: 100
Total time: 18.877333ms
Average time per iteration: 188.773µs
Iterations per second: 5297.36

STEP 6: COMPARE RESULTS

STEP 7: MEASURE WALL CLOCK TIME

Command (Default):
time ./prof_default

Output:
Iterations: 100
Total time: 21.286125ms
Average time per iteration: 212.861µs
Iterations per second: 4697.90
real 0m0.844s
user 0m0.023s
sys 0m0.002s

Command (LTO):
time ./prof_lto

PERFORMANCE RESULTS

ARM64 (Apple Silicon) (my testing)
Default: 21.286ms
LTO: 18.877ms
Improvement: -11.3% (FASTER)

x86_64 (Intel/AMD - Original Report):
Default: 26.1ms
LTO: 27.5ms
Regression: +5.06% (SLOWER)

REGISTER ANALYSIS

Code needs: 17 registers

ARM64 (32 available):
Utilization: 53%
Pressure: LOW
Spilling: NO
Result: LTO helps

x86_64 (16 available):
Utilization: 106% (OVERFLOW)
Pressure: HIGH
Spilling: YES
Result: LTO hurts

LLVM IR COMPARISON

Default Build: 206 lines
LTO Build: 206 lines
Difference: 0% (IDENTICAL)

Assembly Comparison:

Default Build: 107 lines
LTO Build: 107 lines
Difference: 0% (IDENTICAL)

Hot Loop: 41 instructions (both)
Registers Used: 17/31 (both)
Stack Spills: 0 (both)

MECHANISM

Without LTO:
Conservative register allocation
Some stack spilling
Performance: 26.1ms (x86_64)

With LTO:
Aggressive inlining
More live values
Need 17+ registers
x86_64 only has 16
MORE stack spilling
Performance: 27.5ms (x86_64)

Chain:
LTO → More inlining → More live values → More register pressure
→ More stack spilling → Slower memory access → Performance loss

ARCHITECTURE DIFFERENCES

ARM64 (ARM64 has 2x more registers):
GPRs: 32
L1 I-Cache: 192 KB
L1 D-Cache: 128 KB
Pipeline: 8-stage
LTO Effect: +11.3%

x86_64:
GPRs: 16
L1 I-Cache: 32 KB
L1 D-Cache: 32 KB
Pipeline: 14-19 stage
LTO Effect: -5%

CONCLUSION

Bug is REAL but PLATFORM-SPECIFIC

ARM64 (32 registers):
LTO improves performance (+11.3%)
No register pressure
Optimization works well

x86_64 (16 registers):
LTO hurts performance (-5%)
High register pressure
More stack spilling

There for:
More registers → LTO helps
Fewer registers → LTO hurts

@naoNao89
Copy link
Contributor

naoNao89 commented Nov 8, 2025

Run Multiple Times for Stability

for i in {1..5}; do ./prof_default; done
for i in {1..5}; do ./prof_lto; done

Calculate average for more accurate results

@oech3
Copy link
Contributor Author

oech3 commented Nov 8, 2025

Not compilable your .rs on 1.91.0 (Arch Linux)... But I can RUSTFLAGS="-C... " cargo --profile=... build -p uu_... instead.

@oech3 oech3 closed this Nov 9, 2025
@oech3 oech3 deleted the patch-1 branch November 9, 2025 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants