-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Description
In a tight while-loop iterating over a large buffer, clang++ inserts a series of addl instructions inside the loop that hurts performance alot.
This is a problem in many kinds of code that runs over large buffers, such as data compression, etc.
Assume we have this construction:
while (*(src + b) == *(buf + b) && src + b < last) {
b++;
}
if ((b <= 6)) {
ret = (b << 13);
}
else if (b <= 14) {
ret = (b << 18);
}
else if (b <= 62) {
ret = (b << 22);
}
else {
ret = (b << 50);
}
The while-loop disassembles to this:
.LBB1_4: # Parent Loop BB1_1 Depth=1
leaq (%rax,%r8), %rdx
leaq 1(%r8), %rsi
addl $8192, %ebp # imm = 0x2000
addl $262144, %edi # imm = 0x40000
addl $4194304, %ecx # imm = 0x400000
cmpq $99999999, %rdx # imm = 0x5F5E0FF
jg .LBB1_6
# %bb.5: # in Loop: Header=BB1_4 Depth=3
movzbl (%r8,%r12), %r9d
movq %rsi, %r8
cmpb %r9b, (%r12,%rdx)
je .LBB1_4
.LBB1_6:
It looks like clang++ attempts to optimize for the later if-else statements by precomputing the left-shifted values of b.
If we place the while loop in a separate function we can avoid this and get following timings:
With addl: 43 milliseconds
Without addl: 23 milliseconds
GCC and Visual C++ don't have this problem and run at around 23 milliseconds in both cases.
Compile command line is: clang++ -O3 -std=c++20 demo.cpp.
Version:
Ubuntu clang version 20.1.2 (0ubuntu1)
Target: x86_64-pc-linux-gnu
Thread model: posix
Sample code that includes a benchmark: demo.txt