Skip to content

AArch64: folly::ConcurrentHashMap codegen could be better / LLVM is conservative around atomicsΒ #160733

@MatzeB

Description

@MatzeB

We noticed codegen in the context of folly::ConcurrentHashMap (aka code with a bunch of std::atomic) could be better. I am told this a reduction of an important function, that a colleague tried to vectorize:

#include <arm_neon.h>
#include <atomic>

struct __attribute__((packed)) mystruct {
    std::atomic<uint64_t> low_;
    std::atomic<uint64_t> hi_;
};

uint64_t occupiedMask(mystruct& tags_, uint64_t kFullMask) {
    uint64x2_t vec;
    vec[0] = tags_.low_.load(std::memory_order_relaxed);
    vec[1] = tags_.hi_.load(std::memory_order_relaxed);
    // signed shift extends top bit to all bits
    auto occupiedV =
        vreinterpretq_u8_s8(vshrq_n_s8(vreinterpretq_s8_u64(vec), 7));
    uint8x8_t maskV = vshrn_n_u16(vreinterpretq_u16_u8(occupiedV), 4);
    return vget_lane_u64(vreinterpret_u64_u8(maskV), 0) & kFullMask;
}

Currently produces:

occupiedMask(mystruct&, unsigned long):
        ldr     x8, [x0]
        ldr     x9, [x0, #8]
        fmov    d0, x8
        mov     v0.d[1], x9
        cmlt    v0.16b, v0.16b, #0
        shrn    v0.8b, v0.8h, #4
        fmov    x8, d0
        and     x0, x8, x1
        ret

(godbolt equivalent: https://godbolt.org/z/xobWMhe7W )

but we think this could ideally be (equavalent to the code you currently get when removing the atomics):

        ldr     q0, [x0]
        cmlt    v0.16b, v0.16b, #0
        shrn    v0.8b, v0.8h, #4
        fmov    x8, d0
        and     x0, x8, x1
        ret

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions