Skip to content

Commit 3fdd83c

Browse files
committed
Rationalize -mcpu for emulators, compilers and assemblers on ARM
Move SVE example in from arm-assembly-cheat. atomic.cpp aarch64 add LSE ldadd placeholder, not compiling yet
1 parent ce3d546 commit 3fdd83c

File tree

8 files changed

+168
-19
lines changed

8 files changed

+168
-19
lines changed

README.adoc

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14006,6 +14006,60 @@ There are analogous LD3 and LD4 instruction.
1400614006
* assembly optimized libraries:
1400714007
** https://github.com/projectNe10/Ne10
1400814008

14009+
==== ARM SVE
14010+
14011+
Example: link:userland/arch/aarch64/sve.S[]
14012+
14013+
Scalable Vector Extension.
14014+
14015+
aarch64 only, newer than <<arm-neon>>.
14016+
14017+
It is called Scalable because it does not specify the vector width! Therefore we don't have to worry about new vector width instructions every few years! Hurray!
14018+
14019+
The instructions then allow implicitly tracking the loop index without knowing the actual vector length.
14020+
14021+
Added to QEMU use mode in 3.0.0.
14022+
14023+
TODO announcement date. Possibly 2017: https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf There is also a 2016 mention: https://community.arm.com/tools/hpc/b/hpc/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
14024+
14025+
The Linux kernel shows `/proc/cpuinfo` compatibility as `sve`.
14026+
14027+
Official spec: https://developer.arm.com/docs/100891/latest/sve-overview/introducing-sve
14028+
14029+
===== SVE bibliography
14030+
14031+
* https://www.rico.cat/files/ICS18-gem5-sve-tutorial.pdf step by step of a complete code execution examples, the best initial tutorial so far
14032+
* https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf paper with some nice few concrete examples, illustrations and rationale
14033+
* https://static.docs.arm.com/dui0965/c/DUI0965C_scalable_vector_extension_guide.pdf
14034+
* https://developer.arm.com/products/software-development-tools/hpc/documentation/writing-inline-sve-assembly quick inlining guide
14035+
14036+
====== SVE spec
14037+
14038+
<<armarm8>> A1.7 "ARMv8 architecture extensions" says:
14039+
14040+
____
14041+
SVE is an optional extension to ARMv8.2. That is, SVE requires the implementation of ARMv8.2.
14042+
____
14043+
14044+
A1.7.8 "The Scalable Vector Extension (SVE)": then says that only changes to the existing registers are described in that manual, and that you should look instead at the "ARM Architecture Reference Manual Supplement, The Scalable Vector Extension (SVE), for ARMv8-A."
14045+
14046+
We then download the zip from: https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a and it contains the PDF: `DDI0584A_d_SVE_supp_armv8A.pdf` which we use here.
14047+
14048+
That document then describes the SVE instructions and registers.
14049+
14050+
=== ARMv8 architecture extensions
14051+
14052+
==== ARMv8.1 architecture extension
14053+
14054+
<<armarm8-db>> A1.7.3 "The ARMv8.1 architecture extension"
14055+
14056+
[[arm-lse]]
14057+
===== ARM Large System Extensions (LSE)
14058+
14059+
<<armarm8-db>> "ARMv8.1-LSE, ARMv8.1 Large System Extensions"
14060+
14061+
* LDADD: link:userland/cpp/atomic.cpp[]
14062+
1400914063
=== ARM assembly bibliography
1401014064

1401114065
==== ARM non-official bibliography

build-baremetal

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,6 @@ Build the baremetal examples with crosstool-NG.
4848
cc_flags = [
4949
'-I', self.env['root_dir'], LF,
5050
'-O{}'.format(self.env['optimization_level']), LF,
51-
'-mcpu={}'.format(self.env['mcpu']), LF,
5251
'-nostartfiles', LF,
5352
]
5453
if self.env['arch'] == 'arm':

common.py

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -658,20 +658,40 @@ def join(*paths):
658658
else:
659659
env['gem5_build_id'] = consts['default_build_id']
660660
env['is_arm'] = False
661+
# Our approach is as follows:
662+
#
663+
# * compilers: control maximum arch version emitted explicitly -mcpu
664+
# +
665+
# This helps to prevent blowing up simulation unnecessarily.
666+
# +
667+
# It does not matter if we miss any perf features for QEMU which is functional,
668+
# but it could matter for gem5 perf simulations.
669+
# * assemblers: enable as many features as possible.
670+
# +
671+
# Well, if I'm explicitly writing down the instructions, I want
672+
# my emulator to blow up in peace!
673+
# * emulators: enable as many features as possible
674+
# +
675+
# This is the gem5 default behavior, for QEMU TODO not sure if default,
676+
# but we select it explicitly with -cpu max.
677+
# https://habkost.net/posts/2017/03/qemu-cpu-model-probing-story.html
678+
# +
679+
# We doe this because QEMU does not add all possible Cortex Axx, there are
680+
# just too many, and gem5 does not allow selecting lower feature in general.
661681
if env['arch'] == 'arm':
662682
env['armv'] = 7
663-
env['mcpu'] = 'cortex-a15'
664683
env['buildroot_toolchain_prefix'] = 'arm-buildroot-linux-gnueabihf'
665684
env['crosstool_ng_toolchain_prefix'] = 'arm-unknown-eabi'
666685
env['ubuntu_toolchain_prefix'] = 'arm-linux-gnueabihf'
667686
env['is_arm'] = True
687+
env['march'] = 'armv8-a'
668688
elif env['arch'] == 'aarch64':
669689
env['armv'] = 8
670-
env['mcpu'] = 'cortex-a57'
671690
env['buildroot_toolchain_prefix'] = 'aarch64-buildroot-linux-gnu'
672691
env['crosstool_ng_toolchain_prefix'] = 'aarch64-unknown-elf'
673692
env['ubuntu_toolchain_prefix'] = 'aarch64-linux-gnu'
674693
env['is_arm'] = True
694+
env['march'] = 'armv8-a+lse'
675695
elif env['arch'] == 'x86_64':
676696
env['crosstool_ng_toolchain_prefix'] = 'x86_64-unknown-elf'
677697
env['gem5_arch'] = 'X86'
@@ -1545,6 +1565,10 @@ def _build_one(
15451565
cc_flags_after.extend(['-pthread', LF])
15461566
if self.need_rebuild([in_path] + extra_objs + extra_deps, out_path):
15471567
cc_flags.extend(my_path_properties['cc_flags'])
1568+
if self.env['verbose']:
1569+
cc_flags.extend([
1570+
'-v', LF,
1571+
])
15481572
cc_flags_after.extend(my_path_properties['cc_flags_after'])
15491573
if my_path_properties['cc_pedantic']:
15501574
cc_flags.extend(['-pedantic', LF])
@@ -1557,6 +1581,15 @@ def _build_one(
15571581
elif in_ext == self.env['cxx_ext']:
15581582
cc = self.env['gxx_path']
15591583
std = my_path_properties['cxx_std']
1584+
if self.env['is_arm']:
1585+
if in_ext == self.env['asm_ext']:
1586+
cc_flags.extend([
1587+
'-Xassembler', '-march=all', LF,
1588+
])
1589+
else:
1590+
cc_flags.extend([
1591+
'-march={}'.format(self.env['march']), LF,
1592+
])
15601593
if dirpath_relative_root_components_len > 0:
15611594
if dirpath_relative_root_components[0] == 'userland':
15621595
if dirpath_relative_root_components_len > 1:

path_properties.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -344,7 +344,6 @@ def get(path):
344344
{
345345
'allowed_archs': {'arm'},
346346
'cc_flags': [
347-
'-Xassembler', '-mcpu=cortex-a72', LF,
348347
# To prevent:
349348
# > vfp.S: Error: selected processor does not support <FPU instruction> in ARM mode
350349
# https://stackoverflow.com/questions/41131432/cross-compiling-error-selected-processor-does-not-support-fmrx-r3-fpexc-in/52875732#52875732
@@ -383,7 +382,9 @@ def get(path):
383382
}
384383
),
385384
'aarch64': (
386-
{'allowed_archs': {'aarch64'}},
385+
{
386+
'allowed_archs': {'aarch64'},
387+
},
387388
{
388389
'inline_asm': (
389390
{
@@ -399,6 +400,7 @@ def get(path):
399400
'signal_generated_by_os': True,
400401
'signal_received': signal.Signals.SIGILL,
401402
},
403+
'sve.S': {'gem5_unimplemented_instruction': True}
402404
}
403405
),
404406
'x86_64': (

run

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -576,6 +576,7 @@ Extra options to append at the end of the emulator command line.
576576
qemu_user_and_system_options +
577577
debug_args
578578
)
579+
cpu = 'max'
579580
else:
580581
extra_emulator_args.extend(extra_qemu_args)
581582
self.make_run_dirs()
@@ -594,9 +595,11 @@ Extra options to append at the end of the emulator command line.
594595
serial_monitor = ['-serial', serial, LF]
595596
if self.env['kvm']:
596597
extra_emulator_args.extend([
597-
'-cpu', 'host', LF,
598598
'-enable-kvm', LF,
599599
])
600+
cpu = 'host'
601+
else:
602+
cpu = 'max'
600603
extra_emulator_args.extend([
601604
'-serial',
602605
'tcp::{},server,nowait'.format(self.env['extra_serial_port']), LF
@@ -706,19 +709,15 @@ Extra options to append at the end of the emulator command line.
706709
])
707710
elif self.env['is_arm']:
708711
extra_emulator_args.extend(['-semihosting', LF])
709-
if self.env['arch'] == 'arm':
710-
cpu = 'cortex-a15'
711-
else:
712-
cpu = 'cortex-a57'
713712
append = ['-append', '{} {}'.format(root, kernel_cli), LF]
714713
cmd.extend(
715-
[
716-
'-cpu', cpu, LF,
717-
] +
718714
virtio_gpu_pci
719715
)
720716
if self.env['baremetal'] is None:
721717
cmd.extend(append)
718+
extra_emulator_args.extend([
719+
'-cpu', cpu, LF,
720+
])
722721
if self.env['tmux']:
723722
tmux_args = '--run-id {}'.format(self.env['run_id'])
724723
if self.env['tmux_program'] == 'shell':

userland/arch/aarch64/sve.S

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#arm-sve */
2+
3+
#include <lkmc.h>
4+
5+
.data
6+
x: .double 1.5, 2.5, 3.5, 4.5
7+
y: .double 5.0, 6.0, 7.0, 8.0
8+
y_expect: .double 8.0, 11.0, 14.0, 17.0
9+
a: .double 2.0
10+
n: .word 4
11+
12+
LKMC_PROLOGUE
13+
adr x0, x
14+
adr x1, y
15+
adr x2, a
16+
adr x3, n
17+
bl daxpy
18+
LKMC_ASSERT_MEMCMP(y, y_expect, =0x20)
19+
LKMC_EPILOGUE
20+
21+
/* Multiply by a scalar and add.
22+
*
23+
* Operation:
24+
*
25+
* Y += a * X
26+
*
27+
* C signature:
28+
*
29+
* void daxpy(double *x, double *y, double *a, int *n)
30+
*
31+
* The name "daxpy" comes from LAPACK:
32+
* http://www.netlib.org/lapack/explore-html/de/da4/group__double__blas__level1_ga8f99d6a644d3396aa32db472e0cfc91c.html
33+
*
34+
* Adapted from: https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf
35+
*/
36+
daxpy:
37+
ldrsw x3, [x3]
38+
mov x4, 0
39+
whilelt p0.d, x4, x3
40+
ld1rd z0.d, p0/z, [x2]
41+
.loop:
42+
ld1d z1.d, p0/z, [x0, x4, lsl 3]
43+
ld1d z2.d, p0/z, [x1, x4, lsl 3]
44+
fmla z2.d, p0/m, z1.d, z0.d
45+
st1d z2.d, p0, [x1, x4, lsl 3]
46+
incd x4
47+
whilelt p0.d, x4, x3
48+
b.first .loop
49+
ret

userland/arch/x86_64/cmpxchg.S

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/* https://github.com/cirosantilli/linux-kernel-module-cheat#cmpxchg-instruction */
1+
/* https://github.com/cirosantilli/linux-kernel-module-cheat#x86-cmpxchg-instruction */
22

33
#include <lkmc.h>
44

@@ -24,5 +24,4 @@ LKMC_PROLOGUE
2424
LKMC_ASSERT_EQ(%rax, $0)
2525
LKMC_ASSERT_EQ(%r13, $2)
2626
LKMC_ASSERT_EQ(%r14, $2)
27-
2827
LKMC_EPILOGUE

userland/cpp/atomic.cpp

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
// https://github.com/cirosantilli/linux-kernel-module-cheat#cpp
2-
// https://github.com/cirosantilli/linux-kernel-module-cheat#x86-lock-prefix
32
//
43
// The non-atomic counters have undefined values which get printed:
54
// they are extremely likely to be less than the correct value due to
@@ -15,7 +14,6 @@
1514
// On GCC 4.8 x86-64, using atomic offered a 5x peformance improvement
1615
// over the same program with mutexes.
1716

18-
1917
#if __cplusplus >= 201103L
2018
#include <atomic>
2119
#include <cassert>
@@ -24,7 +22,7 @@
2422
#include <vector>
2523
std::atomic_ulong my_atomic_ulong(0);
2624
unsigned long my_non_atomic_ulong = 0;
27-
#if defined(__x86_64__)
25+
#if defined(__x86_64__) || defined(__aarch64__)
2826
unsigned long my_arch_atomic_ulong = 0;
2927
unsigned long my_arch_non_atomic_ulong = 0;
3028
#endif
@@ -41,13 +39,29 @@ void threadMain() {
4139
:
4240
:
4341
);
42+
// https://github.com/cirosantilli/linux-kernel-module-cheat#x86-lock-prefix
4443
__asm__ __volatile__ (
4544
"lock;"
4645
"incq %0;"
4746
: "+m" (my_arch_atomic_ulong)
4847
:
4948
:
5049
);
50+
#elif defined(__aarch64__)
51+
__asm__ __volatile__ (
52+
"add %0, %0, 1;"
53+
: "+r" (my_arch_non_atomic_ulong)
54+
:
55+
:
56+
);
57+
// https://github.com/cirosantilli/linux-kernel-module-cheat#arm-lse
58+
__asm__ __volatile__ (
59+
"ldadd %[inc], xzr, [%[addr]];"
60+
: "=m" (my_arch_atomic_ulong)
61+
: [inc] "r" (1),
62+
[addr] "r" (&my_arch_atomic_ulong)
63+
:
64+
);
5165
#endif
5266
}
5367
}
@@ -75,7 +89,7 @@ int main(int argc, char **argv) {
7589
// We can also use the atomics direclty through `operator T` conversion.
7690
assert(my_atomic_ulong == my_atomic_ulong.load());
7791
std::cout << "my_non_atomic_ulong " << my_non_atomic_ulong << std::endl;
78-
#if defined(__x86_64__)
92+
#if defined(__x86_64__) || defined(__aarch64__)
7993
assert(my_arch_atomic_ulong == nthreads * niters);
8094
std::cout << "my_arch_non_atomic_ulong " << my_arch_non_atomic_ulong << std::endl;
8195
#endif

0 commit comments

Comments
 (0)