Skip to content

[AArch64] Suboptimal code for 128bit multiplication by constants #161451

@Kmeakin

Description

@Kmeakin

https://godbolt.org/z/ejhrxofdb

For certain constants, GCC generates faster and/or smaller code than LLVM

Example 1

eg for x * 3, GCC generates both smaller and faster code:

LLVM

mul_3(unsigned __int128):
        mov     w8, #3
        add     x9, x1, x1, lsl #1
        umulh   x8, x0, x8
        add     x0, x0, x0, lsl #1
        add     x1, x8, x9
        ret

Iterations:        100
Instructions:      600
Total Cycles:      602
Total uOps:        600

Dispatch Width:    3
uOps Per Cycle:    1.00
IPC:               1.00
Block RThroughput: 2.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        mov	w8, #3
 1      2     0.33                        add	x9, x1, x1, lsl #1
 1      5     2.00                        umulh	x8, x0, x8
 1      2     0.33                        add	x0, x0, x0, lsl #1
 1      1     0.33                        add	x1, x8, x9
 1      1     1.00                  U     ret

GCC

mul_3(unsigned __int128):
        lsl     x2, x0, 1
        extr    x3, x1, x0, 63
        adds    x0, x2, x0
        adc     x1, x3, x1
        ret

Iterations:        100
Instructions:      500
Total Cycles:      302
Total uOps:        500

Dispatch Width:    3
uOps Per Cycle:    1.66
IPC:               1.66
Block RThroughput: 1.7


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      2     0.33                        lsl	x2, x0, #1
 1      2     0.33                        extr	x3, x1, x0, #63
 1      1     0.33                        adds	x0, x2, x0
 1      1     0.33                        adc	x1, x3, x1
 1      1     1.00                  U     ret

Example 2

eg for x * 10, GCC generates code that is longer, but faster than LLVM:

LLVM

mul_10(unsigned __int128):
        mov     w8, #10
        umulh   x9, x0, x8
        madd    x1, x1, x8, x9
        add     x8, x0, x0, lsl #2
        lsl     x0, x8, #1
        ret

Iterations:        100
Instructions:      600
Total Cycles:      1002
Total uOps:        600

Dispatch Width:    3
uOps Per Cycle:    0.60
IPC:               0.60
Block RThroughput: 4.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.33                        mov	w8, #10
 1      5     2.00                        umulh	x9, x0, x8
 1      5     2.00                        madd	x1, x1, x8, x9
 1      2     0.33                        add	x8, x0, x0, lsl #2
 1      2     0.33                        lsl	x0, x8, #1
 1      1     1.00                  U     ret

GCC

mul_10(unsigned __int128):
        lsl     x2, x0, 2
        extr    x3, x1, x0, 62
        adds    x2, x2, x0
        adc     x1, x3, x1
        lsl     x0, x2, 1
        extr    x1, x1, x2, 63
        ret

Iterations:        100
Instructions:      700
Total Cycles:      502
Total uOps:        700

Dispatch Width:    3
uOps Per Cycle:    1.39
IPC:               1.39
Block RThroughput: 2.3


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      2     0.33                        lsl	x2, x0, #2
 1      2     0.33                        extr	x3, x1, x0, #62
 1      1     0.33                        adds	x2, x2, x0
 1      1     0.33                        adc	x1, x3, x1
 1      2     0.33                        lsl	x0, x2, #1
 1      2     0.33                        extr	x1, x1, x2, #63
 1      1     1.00                  U     ret

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions