-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Open
Labels
Description
We noticed codegen in the context of folly::ConcurrentHashMap (aka code with a bunch of std::atomic) could be better. I am told this a reduction of an important function, that a colleague tried to vectorize:
#include <arm_neon.h>
#include <atomic>
struct __attribute__((packed)) mystruct {
std::atomic<uint64_t> low_;
std::atomic<uint64_t> hi_;
};
uint64_t occupiedMask(mystruct& tags_, uint64_t kFullMask) {
uint64x2_t vec;
vec[0] = tags_.low_.load(std::memory_order_relaxed);
vec[1] = tags_.hi_.load(std::memory_order_relaxed);
// signed shift extends top bit to all bits
auto occupiedV =
vreinterpretq_u8_s8(vshrq_n_s8(vreinterpretq_s8_u64(vec), 7));
uint8x8_t maskV = vshrn_n_u16(vreinterpretq_u16_u8(occupiedV), 4);
return vget_lane_u64(vreinterpret_u64_u8(maskV), 0) & kFullMask;
}
Currently produces:
occupiedMask(mystruct&, unsigned long):
ldr x8, [x0]
ldr x9, [x0, #8]
fmov d0, x8
mov v0.d[1], x9
cmlt v0.16b, v0.16b, #0
shrn v0.8b, v0.8h, #4
fmov x8, d0
and x0, x8, x1
ret
(godbolt equivalent: https://godbolt.org/z/xobWMhe7W )
but we think this could ideally be (equavalent to the code you currently get when removing the atomics):
ldr q0, [x0]
cmlt v0.16b, v0.16b, #0
shrn v0.8b, v0.8h, #4
fmov x8, d0
and x0, x8, x1
ret
yfeldblum, mcfi and Nicoshev