-
Notifications
You must be signed in to change notification settings - Fork 13
Description
The kernel memcpy used for ARCv3 is using default kernel memcpy implementation that in fact is doing byte-copy what is not efficient, especially with HW prefetcher disabled (as a results of hw bugs).
The ARC port provides some memcpy implementations, but not for ARCv3, for example the "memcpy-archs-unaligned.S" is not used for ARCv3 because it requires zero-overhead loop instructions not present for this architecture. It however, can be modified to match ARCv3 easily.
Here is my proposal for the ARCv3 compatible memcpy that seems to have performance much better than the generic one:
#include <linux/linkage.h>
#ifndef CONFIG_ARC_USE_UNALIGNED_MEM_ACCESS
#error "Unaligned access support needed"
#endif
#define MEMCPY_64_BIT 1
#define MEMCPY_PREFETCH 1
#if MEMCPY_64_BIT
# define LOADX(DST,RX) ldd.ab DST, [RX, 8]
# define STOREX(SRC,RX) std.ab SRC, [RX, 8]
# define ZOLSHFT 5
# define ZOLAND 0x1F
#else
# define LOADX(DST,RX) ld.ab DST, [RX, 4]
# define STOREX(SRC,RX) st.ab SRC, [RX, 4]
# define ZOLSHFT 4
# define ZOLAND 0xF
#endif
ENTRY_CFI(memcpy)
mov r3, r0 ; don;t clobber ret val
lsr.f r12, r2, ZOLSHFT
beq_s .Lcopy32_64bytes_done
.Lcopy32_64bytes_loop:
#if MEMCPY_PREFETCH
prefetch [r1, 2 << ZOLSHFT]
#endif
LOADX (r4, r1)
LOADX (r6, r1)
LOADX (r8, r1)
LOADX (r10, r1)
STOREX (r4, r3)
STOREX (r6, r3)
STOREX (r8, r3)
STOREX (r10, r3)
dbnz r12, .Lcopy32_64bytes_loop
.Lcopy32_64bytes_done:
and.f r12, r2, ZOLAND ;Last remaining 31 bytes
beq_s .Lcopyremainingbytes_done
.Lcopyremainingbytes_loop:
#if MEMCPY_PREFETCH
prefetch [r1, 2 << ZOLSHFT]
#endif
ldb.ab r4, [r1, 1]
stb.ab r4, [r3, 1]
dbnz r12, .Lcopyremainingbytes_loop
.Lcopyremainingbytes_done:
j [blink]
END_CFI(memcpy)
What do you think - is this implementation correct from your point of view as ARC experts ? Or maybe you have even a better ideas how to improve it ?
Could you add it to the official ARC port ? Also, a similar improvement could be applied for the glibc/uClibc - if not upstream then maybe at least as patch in the ARC buildroot.