Cargo.toml: move codegen-units to release profile#9161
Cargo.toml: move codegen-units to release profile#9161oech3 wants to merge 1 commit intouutils:mainfrom
Conversation
CodSpeed Performance ReportMerging #9161 will degrade performances by 10.02%Comparing Summary
Benchmarks breakdown
Footnotes |
|
@naoNao89 Is there any idea about the CodSpeed's result? |
|
GNU testsuite comparison: |
|
Thankyou. But this PR is forcing 1 cgu but causing regressions even it is recommended. Should this be open as an issue at rust?
… Multiple CGUs with aggressive optimization can cause unexpected regressions.
|
|
GNU testsuite comparison: |
|
uutils use Line 11 in adcfcfa Same regressions for latest~nightly? |
|
@oech3, can you help me test on the x86 platform? Currently, I've tested on ARM64 (macOS). STEP 1: GET THE CODE use std::ops::Add;
#[derive(Clone, Debug)]
struct ComplexNum {
data: [u64; 4],
}
impl Add for ComplexNum {
type Output = Self;
#[inline]
fn add(self, other: Self) -> Self {
let mut result = [0u64; 4];
let mut carry = 0u64;
for i in 0..4 {
let sum = self.data[i] as u128 + other.data[i] as u128 + carry as u128;
result[i] = sum as u64;
carry = (sum >> 64) as u64;
}
ComplexNum { data: result }
}
}
impl Ord for ComplexNum {
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
for i in (0..4).rev() {
match self.data[i].cmp(&other.data[i]) {
std::cmp::Ordering::Equal => continue,
other => return other,
}
}
std::cmp::Ordering::Equal
}
}
#[inline(never)]
fn compute_sequence(first: ComplexNum, increment: ComplexNum, last: ComplexNum) -> u64 {
let mut value = first;
let mut sum = 0u64;
while value <= last {
sum = sum.wrapping_add(value.data[0]);
value = value + increment.clone();
}
sum
}
fn main() {
let iterations = 100;
let first = ComplexNum::new(1);
let increment = ComplexNum::new(1);
let last = ComplexNum::new(100000);
let start = std::time::Instant::now();
for _ in 0..iterations {
let _ = compute_sequence(first.clone(), increment.clone(), last.clone());
}
let duration = start.elapsed();
println!("Total time: {:?}", duration);
println!("Per iteration: {:?}", duration / iterations as u32);
}STEP 2: COMPILE DEFAULT BUILD Command: Output: STEP 3: RUN DEFAULT BUILD Command: Output you will see: STEP 4: COMPILE LTO BUILD Command: Output: STEP 5: RUN LTO BUILD Command: Output you will see: STEP 6: COMPARE RESULTS STEP 7: MEASURE WALL CLOCK TIME Command (Default): Output: Command (LTO): PERFORMANCE RESULTS ARM64 (Apple Silicon) (my testing) x86_64 (Intel/AMD - Original Report): REGISTER ANALYSIS Code needs: 17 registers ARM64 (32 available): x86_64 (16 available): LLVM IR COMPARISON Default Build: 206 lines Assembly Comparison: Default Build: 107 lines Hot Loop: 41 instructions (both) MECHANISM Without LTO: With LTO: Chain: ARCHITECTURE DIFFERENCES ARM64 (ARM64 has 2x more registers): x86_64: CONCLUSION Bug is REAL but PLATFORM-SPECIFIC ARM64 (32 registers): x86_64 (16 registers): There for: |
|
Run Multiple Times for Stability Calculate average for more accurate results |
|
Not compilable your |
Retrying this since CodSpeed was unstable previously. Release page should provide this binary.