-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Open
Description
https://godbolt.org/z/ejhrxofdb
For certain constants, GCC generates faster and/or smaller code than LLVM
Example 1
eg for x * 3, GCC generates both smaller and faster code:
LLVM
mul_3(unsigned __int128):
mov w8, #3
add x9, x1, x1, lsl #1
umulh x8, x0, x8
add x0, x0, x0, lsl #1
add x1, x8, x9
ret
Iterations: 100
Instructions: 600
Total Cycles: 602
Total uOps: 600
Dispatch Width: 3
uOps Per Cycle: 1.00
IPC: 1.00
Block RThroughput: 2.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 mov w8, #3
1 2 0.33 add x9, x1, x1, lsl #1
1 5 2.00 umulh x8, x0, x8
1 2 0.33 add x0, x0, x0, lsl #1
1 1 0.33 add x1, x8, x9
1 1 1.00 U retGCC
mul_3(unsigned __int128):
lsl x2, x0, 1
extr x3, x1, x0, 63
adds x0, x2, x0
adc x1, x3, x1
ret
Iterations: 100
Instructions: 500
Total Cycles: 302
Total uOps: 500
Dispatch Width: 3
uOps Per Cycle: 1.66
IPC: 1.66
Block RThroughput: 1.7
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 2 0.33 lsl x2, x0, #1
1 2 0.33 extr x3, x1, x0, #63
1 1 0.33 adds x0, x2, x0
1 1 0.33 adc x1, x3, x1
1 1 1.00 U retExample 2
eg for x * 10, GCC generates code that is longer, but faster than LLVM:
LLVM
mul_10(unsigned __int128):
mov w8, #10
umulh x9, x0, x8
madd x1, x1, x8, x9
add x8, x0, x0, lsl #2
lsl x0, x8, #1
ret
Iterations: 100
Instructions: 600
Total Cycles: 1002
Total uOps: 600
Dispatch Width: 3
uOps Per Cycle: 0.60
IPC: 0.60
Block RThroughput: 4.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 mov w8, #10
1 5 2.00 umulh x9, x0, x8
1 5 2.00 madd x1, x1, x8, x9
1 2 0.33 add x8, x0, x0, lsl #2
1 2 0.33 lsl x0, x8, #1
1 1 1.00 U retGCC
mul_10(unsigned __int128):
lsl x2, x0, 2
extr x3, x1, x0, 62
adds x2, x2, x0
adc x1, x3, x1
lsl x0, x2, 1
extr x1, x1, x2, 63
ret
Iterations: 100
Instructions: 700
Total Cycles: 502
Total uOps: 700
Dispatch Width: 3
uOps Per Cycle: 1.39
IPC: 1.39
Block RThroughput: 2.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 2 0.33 lsl x2, x0, #2
1 2 0.33 extr x3, x1, x0, #62
1 1 0.33 adds x2, x2, x0
1 1 0.33 adc x1, x3, x1
1 2 0.33 lsl x0, x2, #1
1 2 0.33 extr x1, x1, x2, #63
1 1 1.00 U ret