Skip to content

Poor optimization of switch statement in Clang 19.1.0 compared to Clang 18.1.0 #127365

@inicula

Description

@inicula

I was doing some tests/benchmarks regarding switch vs array look-ups and found this change in behavior from Clang 18.1.0 to Clang 19.1.0 (and current trunk):

Clang 18.1.0 optimizes that big switch as a constant lookup table:

        lea     rcx, [rip + .Lswitch.table.main]
.LBB0_12:
        movzx   edx, byte ptr [rbx + rax - 3]
        xor     edx, 128
        mov     rdx, qword ptr [rcx + 8*rdx]
        inc     byte ptr [rsp + rdx + 112]

On the other hand, Clang 19.1.0 generates a separate label for each switch case, and every label feeds into a main one:

.LBB0_28:
        lea     rcx, [rsp + 665] ; case 1: return 553;
        jmp     .LBB0_283
.LBB0_29:
        lea     rcx, [rsp + 653] ; case 2: return 541;
        jmp     .LBB0_283
; ..............................
.LBB0_283:
        inc     byte ptr [rcx]
        inc     rbp
        cmp     rbp, 300000000
        je      .LBB0_20
        movzx   ecx, byte ptr [rbx + rbp]
        movsxd  rdx, dword ptr [rax + 4*rcx]
        add     rdx, rax
        mov     rcx, r13
        jmp     rdx

This can tank the performance, for example if the branch predictor can't accurately predict which label you're going to access on the current iteration. In my example I'm generating random indexes and with perf stat I'm seeing almost 300 million branch misses (one for each increment() invocation).

Assuming that this change isn't an intentional trade-off made for a benefit in some other usecases, then this is a regression.

On my machine, the results of running that binary (same source code as the one in Godbolt) compiled with Clang 18 vs Clang 19 are as follows:

clang 18 = elapsed: 254ms sum: 28928
clang 19 = elapsed: 2813ms sum: 29184

So the binary generated by Clang 19 is about 11 times slower.

NOTE: The issue seems to be related to inlining, because if I add __attribute__((noinline)) to the increment() function, then Clang 19 optimizes it with a lookup table, just like Clang 18, and the result is much faster than what I get with inlining allowed:

elapsed: 548ms sum: 28928

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions