Skip to content

ARCv3 low memcpy performance because of no optimized memcpy variant present #177

@jzbydniewski

Description

@jzbydniewski

The kernel memcpy used for ARCv3 is using default kernel memcpy implementation that in fact is doing byte-copy what is not efficient, especially with HW prefetcher disabled (as a results of hw bugs).

The ARC port provides some memcpy implementations, but not for ARCv3, for example the "memcpy-archs-unaligned.S" is not used for ARCv3 because it requires zero-overhead loop instructions not present for this architecture. It however, can be modified to match ARCv3 easily.

Here is my proposal for the ARCv3 compatible memcpy that seems to have performance much better than the generic one:

#include <linux/linkage.h>

#ifndef CONFIG_ARC_USE_UNALIGNED_MEM_ACCESS
#error "Unaligned access support needed"
#endif

#define MEMCPY_64_BIT    1
#define MEMCPY_PREFETCH  1

#if MEMCPY_64_BIT
# define LOADX(DST,RX)		ldd.ab	DST, [RX, 8]
# define STOREX(SRC,RX)		std.ab	SRC, [RX, 8]
# define ZOLSHFT		5
# define ZOLAND			0x1F
#else
# define LOADX(DST,RX)		ld.ab	DST, [RX, 4]
# define STOREX(SRC,RX)		st.ab	SRC, [RX, 4]
# define ZOLSHFT		4
# define ZOLAND			0xF
#endif

ENTRY_CFI(memcpy)
  mov	r3, r0		; don;t clobber ret val

  lsr.f	r12, r2, ZOLSHFT
  beq_s .Lcopy32_64bytes_done
.Lcopy32_64bytes_loop:
#if MEMCPY_PREFETCH
  prefetch [r1, 2 << ZOLSHFT]
#endif
  LOADX	(r4, r1)
  LOADX	(r6, r1)
  LOADX	(r8, r1)
  LOADX	(r10, r1)
  STOREX	(r4, r3)
  STOREX	(r6, r3)
  STOREX	(r8, r3)
  STOREX	(r10, r3)
  dbnz r12, .Lcopy32_64bytes_loop
.Lcopy32_64bytes_done:

  and.f	r12, r2, ZOLAND ;Last remaining 31 bytes
  beq_s .Lcopyremainingbytes_done
.Lcopyremainingbytes_loop:
#if MEMCPY_PREFETCH
  prefetch [r1, 2 << ZOLSHFT]
#endif
  ldb.ab	r4, [r1, 1]
  stb.ab	r4, [r3, 1]
  dbnz r12, .Lcopyremainingbytes_loop
.Lcopyremainingbytes_done:

  j	[blink]
END_CFI(memcpy)

What do you think - is this implementation correct from your point of view as ARC experts ? Or maybe you have even a better ideas how to improve it ?

Could you add it to the official ARC port ? Also, a similar improvement could be applied for the glibc/uClibc - if not upstream then maybe at least as patch in the ARC buildroot.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions