Skip to content

Commit 278ae07

Browse files
committed
Eliminate measured boot-time hot paths
Trace evidence (kernel_summary.txt top symbols) drove four targeted changes plus a methodology fix. Together they cut matched_kernel_blocks by 8.4% (3,384,139 -> 3,100,408) and shift PGO concentration over the layout-ordering threshold (top_32_ratio 0.3301 -> 0.3966, layout_ordering_recommended flips no -> yes). linux.axf size is unchanged at 1,303,072 bytes -- these are perf wins, not size wins. patches/0011-tiny-arm-nommu-pfn-valid.patch Make `select HAVE_ARCH_PFN_VALID` conditional on `MMU || !ARM_SINGLE_ARMV7M` so Cortex-M boards drop the heavyweight arch override (memblock_overlaps_region per call, plus the slow for_each_valid_pfn fallback that calls pfn_valid per PFN inside init_unavailable_range). On a single contiguous DRAM bank with virt == phys the generic FLATMEM bounds check is correct. Trace before: pfn_valid 271,195 hits (8.0%) + init_unavailable_range 135,178 hits (4.0%) = ~12% of matched boot blocks, both gone after. Older ARM7/ARM9 NOMMU multi-bank boards keep the old check via the !ARM_SINGLE_ARMV7M clause. configs/mps2-slim.dts Hand-pruned replacement for the kernel-shipped mps2-an385.dtb (3,756 -> 1,550 bytes). Drops nodes with no driver or status = "disabled" on this build: armv7m-systick (disabled), sp804 dual-timer (disabled), serial@5000/6000 (disabled), sp805 watchdog (no driver), 4 unused fixed-clocks + 2 fixed-factor-clocks, fpga@40020000/register-bit-led (no syscon, no LEDS_CLASS), smb/ethernet@0,0 smsc,lan9220 (no SMSC driver, CONFIG_NET=n). Keeps NVIC, mps2-timer0+1 (the driver requires both for clocksource + clockevent), serial@4000, clock-50000000 + clock-sys, aliases, chosen, memory. build_bootwrapper compiles this DTS via host dtc instead of copying mps2-an385.dtb; a dtc preflight check fails fast if the host package is missing. Trace: fdt32_ld + fdt_offset_ptr + fdt_next_tag dropped from 217,177 hits (7.3%) to 84,292 (2.7%), a -54.5% reduction. configs/pgo-workload.txt + scripts/{validate,qemu-profile}.expect PGO trace was inflated by ~225K hits (8% of matched blocks) of UART/n_tty/uart_port_* code driven by verbose `cat /proc/<file>` output. Workload now redirects bulk output to /dev/null and emits a single "ok" marker; restricted to applets actually compiled into busybox (echo, uname, cat, cp, ln, ls, mkdir, mv, rm, test) since the original used readlink/grep/printf/rmdir which aren't built and only "passed" earlier because expect's regex matched the echoed command line, not real output. The expect runners now disable terminal echo via `busybox stty -echo` right after wait_for_shell -- without it the line discipline feeds each "...echo ok" command back into expect's buffer and the generic `ok` regex would match the echoed command line before the program ran. wait_for_shell itself was hardened with the same double-marker pattern (match both the echo and the actual stdout) so it cannot return while busybox is still forking. build.sh comment block Records the negative result for LZ4 and ZSTD initramfs compression: lib/decompress_unlz4.c hardcodes an 8 MiB output buffer (chunk size from the LZ4 legacy format) and lib/decompress_unzstd.c derives DStream workspace from the compressed file's window header, which under `zstd -19` (scripts/Makefile.lib cmd_zstd) becomes another 8 MiB request. Neither fits in a 16 MiB SSRAM bank's buddy allocator (no order-11 contiguous block exists at boot, and bumping ARCH_FORCE_MAX_ORDER does not coalesce one). Both panic during initramfs unpack; gzip stays. inflate_fast remains 2.8% of matched boot blocks -- the cost is real but the alternatives are unbootable as wired today. Validated end-to-end on a freshly extracted linux-7.0 source tree: patch applies with no fuzz, build produces 1,303,072-byte linux.axf plus 1,550-byte mps2.dtb, validate-qemu.sh + collect-kernel-profile.sh both pass, and the final trace reproduces the headline numbers above.
1 parent 9021e1a commit 278ae07

6 files changed

Lines changed: 244 additions & 23 deletions

File tree

build.sh

Lines changed: 33 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -645,7 +645,7 @@ build_linux() {
645645
cd linux-${LINUX_VERSION}
646646

647647
# Apply linux-tiny patches for reduced memory footprint and LTO support
648-
for p in ../patches/0002-*.patch ../patches/0003-*.patch ../patches/0004-*.patch ../patches/0005-*.patch ../patches/0006-*.patch ../patches/0010-*.patch; do
648+
for p in ../patches/0002-*.patch ../patches/0003-*.patch ../patches/0004-*.patch ../patches/0005-*.patch ../patches/0006-*.patch ../patches/0010-*.patch ../patches/0011-*.patch; do
649649
[ -f "${p}" ] || continue
650650
apply_patch_once "${p}"
651651
done
@@ -689,6 +689,21 @@ build_linux() {
689689
sed -i "/CONFIG_INITRAMFS_SOURCE=/d" .config
690690
echo "CONFIG_INITRAMFS_SOURCE=\"${ROOTFS} ${ROOTDIR}/configs/rootfs.dev\"" >>.config
691691
echo "CONFIG_INITRAMFS_COMPRESSION_GZIP=y" >>.config
692+
# NOTE: LZ4 and ZSTD have been evaluated on this 16 MiB SSRAM target
693+
# and both panic during initramfs unpack. lib/decompress_unlz4.c
694+
# hardcodes an 8 MiB output buffer (LZ4_DEFAULT_UNCOMPRESSED_CHUNK_SIZE)
695+
# which requires an order-11 contiguous block; the buddy allocator
696+
# never forms one at boot on this RAM size even with
697+
# ARCH_FORCE_MAX_ORDER=11. lib/decompress_unzstd.c sizes its DStream
698+
# workspace from the compressed file's windowSize header, and the
699+
# default `zstd -19` (scripts/Makefile.lib cmd_zstd) emits an
700+
# 8 MiB window so the workspace also exceeds MAX_PAGE_ORDER. Any
701+
# switch away from gzip on this target needs either kernel patches
702+
# (chunk size in the LZ4 wrapper, or zstd command-line in
703+
# scripts/Makefile.lib) or installing lzop on the build host. Until
704+
# then, gzip stays and inflate_fast remains ~2.8% of matched boot
705+
# blocks (95k TB hits) -- the cost is real but the alternatives are
706+
# unbootable as wired today.
692707
if [ -n "${KERNEL_CONFIG_FRAGMENT}" ] && [ -f "${KERNEL_CONFIG_FRAGMENT}" ]; then
693708
cat "${KERNEL_CONFIG_FRAGMENT}" >>.config
694709
fi
@@ -1699,15 +1714,29 @@ build_kernel_pgo_cycle() {
16991714
build_bootwrapper() {
17001715
echo "BUILD: building ARM CORTEX boot wrapper"
17011716

1717+
# Reuse the kernel-built dtc when available so fresh hosts do not need
1718+
# a separate device-tree-compiler package just for the bootwrapper DTB.
1719+
DTC_BIN="${ROOTDIR}/linux-${LINUX_VERSION}/scripts/dtc/dtc"
1720+
if [ -x "${DTC_BIN}" ]; then
1721+
:
1722+
elif command -v dtc >/dev/null 2>&1; then
1723+
DTC_BIN=dtc
1724+
else
1725+
echo "ERROR: no dtc available (expected ${DTC_BIN} or host 'dtc' in PATH)" >&2
1726+
exit 1
1727+
fi
1728+
17021729
if [ ! -d bootwrapper ]; then
17031730
run_logged "clone bootwrapper" git clone --depth 1 --single-branch https://github.com/ARM-software/bootwrapper.git -b cortex-m-linux
17041731
fi
17051732

17061733
cd bootwrapper
17071734
cp ../linux-${LINUX_VERSION}/arch/arm/boot/Image .
1708-
# Linux 7.0 does not ship an AN386 DTS; the AN385 DTB is compatible
1709-
# because both FPGA images share the same peripheral and memory map.
1710-
cp ../linux-${LINUX_VERSION}/arch/arm/boot/dts/arm/mps2-an385.dtb mps2.dtb
1735+
# Hand-pruned DTS strips disabled peripherals (sp804/sp805/extra UARTs),
1736+
# the SMSC ethernet, FPGA LED MFD, and unused fixed-clocks; this halves
1737+
# the bootwrapper DTB and roughly halves the early-boot fdt_* hot path.
1738+
run_logged "compile slim mps2 dtb" \
1739+
"${DTC_BIN}" -q -I dts -O dtb -o mps2.dtb ../configs/mps2-slim.dts
17111740
sed -i -e 's/mps2-an399.dtb/mps2.dtb/' -e 's/mps2-an385.dtb/mps2.dtb/' Makefile
17121741
sed -i 's/0x60000000/0x21000000/' Makefile
17131742
sed -i 's/. = PHYS_OFFSET;/. = 0x0;/' linux.lds.S

configs/mps2-slim.dts

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
/dts-v1/;
2+
3+
/ {
4+
#address-cells = <0x01>;
5+
#size-cells = <0x01>;
6+
model = "ARM MPS2 Application Note 385/386 (slim)";
7+
compatible = "arm,mps2";
8+
9+
chosen {
10+
bootargs = "earlycon";
11+
stdout-path = "serial0:9600n8";
12+
};
13+
14+
aliases {
15+
serial0 = "/soc/apb@40000000/serial@4000";
16+
};
17+
18+
memory@21000000 {
19+
device_type = "memory";
20+
reg = <0x21000000 0x1000000>;
21+
};
22+
23+
interrupt-controller@e000e100 {
24+
compatible = "arm,armv7m-nvic";
25+
interrupt-controller;
26+
#interrupt-cells = <0x01>;
27+
reg = <0xe000e100 0xc00>;
28+
phandle = <0x01>;
29+
};
30+
31+
clock-50000000 {
32+
compatible = "fixed-clock";
33+
#clock-cells = <0x00>;
34+
clock-frequency = <0x2faf080>;
35+
phandle = <0x03>;
36+
};
37+
38+
clock-sys {
39+
compatible = "fixed-factor-clock";
40+
clocks = <0x03>;
41+
#clock-cells = <0x00>;
42+
clock-div = <0x02>;
43+
clock-mult = <0x01>;
44+
phandle = <0x02>;
45+
};
46+
47+
soc {
48+
#address-cells = <0x01>;
49+
#size-cells = <0x01>;
50+
compatible = "simple-bus";
51+
interrupt-parent = <0x01>;
52+
ranges;
53+
54+
apb@40000000 {
55+
compatible = "simple-bus";
56+
#address-cells = <0x01>;
57+
#size-cells = <0x01>;
58+
ranges = <0x00 0x40000000 0x10000>;
59+
60+
mps2-timer0@0 {
61+
compatible = "arm,mps2-timer";
62+
reg = <0x00 0x1000>;
63+
interrupts = <0x08>;
64+
clocks = <0x02>;
65+
status = "okay";
66+
};
67+
68+
mps2-timer1@1000 {
69+
compatible = "arm,mps2-timer";
70+
reg = <0x1000 0x1000>;
71+
interrupts = <0x09>;
72+
clocks = <0x02>;
73+
status = "okay";
74+
};
75+
76+
serial@4000 {
77+
compatible = "arm,mps2-uart";
78+
reg = <0x4000 0x1000>;
79+
interrupts = <0x00 0x01 0x0c>;
80+
clocks = <0x02>;
81+
status = "okay";
82+
};
83+
};
84+
};
85+
};

configs/pgo-workload.txt

Lines changed: 28 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,29 @@
11
# label|command|expected regex
2-
busybox-help|busybox|BusyBox v
3-
uname|busybox uname -s|Linux
4-
list-bin|busybox ls /bin|busybox
5-
proc-version|busybox cat /proc/version|Linux version
6-
proc-mounts|busybox cat /proc/mounts|/proc
7-
proc-self-stat|busybox cat /proc/self/stat| \(
8-
list-proc|busybox ls /proc|self
9-
read-self-exe|busybox readlink /proc/self/exe|busybox
10-
shell-path|busybox test -x /bin/sh && busybox echo /bin/sh-ok|/bin/sh-ok
11-
create-file|busybox sh -c 'busybox echo alpha >/tmp/pgo-file && busybox cat /tmp/pgo-file'|alpha
12-
rename-file|busybox sh -c 'busybox mv /tmp/pgo-file /tmp/pgo-file-renamed && busybox cat /tmp/pgo-file-renamed'|alpha
13-
hardlink-file|busybox sh -c 'busybox ln /tmp/pgo-file-renamed /tmp/pgo-file-link && busybox ls -l /tmp/pgo-file-link'|pgo-file-link
14-
truncate-file|busybox sh -c ': >/tmp/pgo-file-renamed && busybox test ! -s /tmp/pgo-file-renamed && busybox echo zero-ok'|zero-ok
15-
mkdir-rmdir|busybox sh -c 'busybox mkdir -p /tmp/pgo-dir && busybox ls /tmp | busybox grep pgo-dir && busybox rmdir /tmp/pgo-dir'|pgo-dir
16-
copy-file|busybox sh -c 'busybox cp /bin/busybox /tmp/pgo-copy && busybox ls -l /tmp/pgo-copy'|pgo-copy
17-
poll-path|busybox sh -c 'busybox printf \"poll-data\" >/tmp/pgo-poll && busybox cat /tmp/pgo-poll'|poll-data
18-
cleanup-tmp|busybox sh -c 'busybox rm -f /tmp/pgo-file-renamed /tmp/pgo-file-link /tmp/pgo-poll && busybox echo cleanup-ok'|cleanup-ok
2+
#
3+
# Each command does the work, redirects bulk output to /dev/null, and emits
4+
# a single short marker so the expect runner sees a constant short string.
5+
# Restricted to applets compiled into BusyBox per configs/busybox-1.37.0.config:
6+
# echo, uname, cat, cp, ln, ls, mkdir, mv, rm, test (and the hush builtins
7+
# `:`, test, echo). Avoids grep, readlink, rmdir, printf, true, head -- all
8+
# disabled in this build.
9+
#
10+
# This keeps the workload's UART/n_tty/uart_port hot path off the boot trace,
11+
# which previously inflated matched_kernel_blocks by ~8% with no insight into
12+
# kernel code paths that matter for layout or size decisions.
13+
busybox-help|: && busybox echo ok|ok
14+
uname|busybox uname -s >/dev/null && busybox echo ok|ok
15+
list-bin|busybox ls /bin >/dev/null && busybox echo ok|ok
16+
proc-version|busybox cat /proc/version >/dev/null && busybox echo ok|ok
17+
proc-mounts|busybox cat /proc/mounts >/dev/null && busybox echo ok|ok
18+
proc-self-stat|busybox cat /proc/self/stat >/dev/null && busybox echo ok|ok
19+
list-proc|busybox ls /proc >/dev/null && busybox echo ok|ok
20+
read-self-exe|busybox ls -l /proc/self/exe >/dev/null && busybox echo ok|ok
21+
shell-path|busybox test -x /bin/sh && busybox echo ok|ok
22+
create-file|busybox sh -c 'busybox echo alpha >/tmp/pgo-file && busybox cat /tmp/pgo-file >/dev/null' && busybox echo ok|ok
23+
rename-file|busybox sh -c 'busybox mv /tmp/pgo-file /tmp/pgo-file-renamed && busybox cat /tmp/pgo-file-renamed >/dev/null' && busybox echo ok|ok
24+
hardlink-file|busybox sh -c 'busybox ln /tmp/pgo-file-renamed /tmp/pgo-file-link && busybox ls /tmp/pgo-file-link >/dev/null' && busybox echo ok|ok
25+
truncate-file|busybox sh -c ': >/tmp/pgo-file-renamed && busybox test ! -s /tmp/pgo-file-renamed' && busybox echo ok|ok
26+
mkdir-rmdir|busybox sh -c 'busybox mkdir -p /tmp/pgo-dir && busybox ls /tmp/pgo-dir >/dev/null && busybox rm -rf /tmp/pgo-dir' && busybox echo ok|ok
27+
copy-file|busybox sh -c 'busybox cp /bin/busybox /tmp/pgo-copy && busybox ls /tmp/pgo-copy >/dev/null' && busybox echo ok|ok
28+
poll-path|busybox sh -c 'busybox echo poll-data >/tmp/pgo-poll && busybox cat /tmp/pgo-poll >/dev/null' && busybox echo ok|ok
29+
cleanup-tmp|busybox sh -c 'busybox rm -f /tmp/pgo-file-renamed /tmp/pgo-file-link /tmp/pgo-poll' && busybox echo ok|ok
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
From: Jim Huang <jserv@ccns.ncku.edu.tw>
2+
Subject: [PATCH] tiny: arm: skip HAVE_ARCH_PFN_VALID on ARMv7-M
3+
4+
The ARM-specific pfn_valid() in arch/arm/mm/init.c walks memblock via
5+
memblock_overlaps_region() on every call. It exists to handle the
6+
mismatch between mem_map and physical memory layout on classic ARM/A
7+
systems with split banks, holes, or aliased physical addresses.
8+
9+
On Cortex-M (CPU_V7M) NOMMU boards built via ARM_SINGLE_ARMV7M the
10+
picture is much simpler:
11+
- one contiguous DRAM bank (e.g. MPS2-AN386 SSRAM at 0x21000000),
12+
- virt == phys, mem_map covers the bank linearly,
13+
- PFN never truncates because the address space fits a u32.
14+
15+
In that regime the generic FLATMEM pfn_valid (single bounds check
16+
against max_mapnr) is both correct and cheap, and disabling
17+
HAVE_ARCH_PFN_VALID also re-enables the tight integer-range
18+
for_each_valid_pfn() loop in init_unavailable_range() instead of the
19+
fallback that calls pfn_valid() per pfn.
20+
21+
Older ARM7/ARM9 NOMMU boards can have multiple discontinuous memory
22+
banks where the FLATMEM bounds check would incorrectly accept PFNs in
23+
the holes; gating on `MMU || !ARM_SINGLE_ARMV7M` keeps those
24+
configurations on the heavyweight memblock-aware check.
25+
26+
Boot-trace (TB-execution rollup) before:
27+
28+
pfn_valid 271,195 hits 8.0% of matched kernel blocks
29+
init_unavailable_range 135,178 hits 4.0%
30+
31+
Together ~12% of matched boot blocks were spent inside the heavyweight
32+
arch override -- pure overhead with a single contiguous bank.
33+
34+
Patch makes the select conditional; MMU builds and non-ARMV7M NOMMU
35+
builds keep the old behavior unchanged.
36+
---
37+
arch/arm/Kconfig | 2 +-
38+
1 file changed, 1 insertion(+), 1 deletion(-)
39+
40+
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
41+
--- a/arch/arm/Kconfig
42+
+++ b/arch/arm/Kconfig
43+
@@ -90,7 +90,7 @@ config ARM
44+
select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN
45+
select HAVE_ARCH_KSTACK_ERASE
46+
select HAVE_ARCH_MMAP_RND_BITS if MMU
47+
- select HAVE_ARCH_PFN_VALID
48+
+ select HAVE_ARCH_PFN_VALID if MMU || !ARM_SINGLE_ARMV7M
49+
select HAVE_ARCH_SECCOMP
50+
select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT
51+
select HAVE_ARCH_THREAD_STRUCT_WHITELIST

scripts/qemu-profile.expect

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,12 @@ proc wait_for_shell {} {
3434
for {set attempt 0} {$attempt < 8} {incr attempt} {
3535
set marker "__PROFILE_READY__$attempt"
3636
send -- "busybox echo ${marker}\r"
37+
# Match the marker TWICE: once in the line-discipline echo of the
38+
# input, then in busybox's actual stdout. The single-match form
39+
# races against shell init -- it returns when the tty buffers the
40+
# echoed command line, before busybox has even forked.
3741
expect {
38-
-re $marker { return }
42+
-re "${marker}\[\\r\\n\]+${marker}" { return }
3943
timeout {
4044
send -- "\r"
4145
}
@@ -153,6 +157,23 @@ expect {
153157

154158
wait_for_shell
155159

160+
# See validate-qemu.expect for the rationale -- disable terminal echo
161+
# so workload commands like "... && busybox echo ok" only have their
162+
# actual output reach expect's buffer (not the echoed command line),
163+
# letting trimmed workloads use simple markers without false positives.
164+
send -- "busybox stty -echo; busybox echo __ECHO_DISABLED__\r"
165+
expect {
166+
-re "__ECHO_DISABLED__\[\\r\\n\]+__ECHO_DISABLED__" {}
167+
timeout {
168+
send_user "ERROR: failed to disable terminal echo\n"
169+
exit 1
170+
}
171+
eof {
172+
send_user "ERROR: QEMU exited before terminal echo could be disabled\n"
173+
exit 1
174+
}
175+
}
176+
156177
set workload_steps [load_workload $workload_path]
157178
set step_index 0
158179
foreach step $workload_steps {

scripts/validate-qemu.expect

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,12 @@ proc wait_for_shell {} {
2626
for {set attempt 0} {$attempt < 6} {incr attempt} {
2727
set marker "__SHELL_READY__$attempt"
2828
send -- "busybox echo ${marker}\r"
29+
# Match the marker TWICE: once in the line-discipline echo of the
30+
# input, then in busybox's actual stdout. Without the second match
31+
# we may return while busybox has not yet finished forking, so
32+
# subsequent commands race against shell init.
2933
expect {
30-
-re $marker {
34+
-re "${marker}\[\\r\\n\]+${marker}" {
3135
set shell_ready_ms [clock milliseconds]
3236
return
3337
}
@@ -148,6 +152,26 @@ expect {
148152

149153
wait_for_shell
150154

155+
# Disable terminal echo before running the workload. Without this the
156+
# tty line-discipline echoes each "busybox echo ok" command back to
157+
# expect's buffer, and a generic regex such as "ok" matches the echoed
158+
# command line before the program's actual output. The double-marker
159+
# expect below waits for both the echoed copy and the actual echo of
160+
# __ECHO_DISABLED__, so by the time we return, stty -echo has taken
161+
# effect and subsequent run_guest_check sees only program output.
162+
send -- "busybox stty -echo; busybox echo __ECHO_DISABLED__\r"
163+
expect {
164+
-re "__ECHO_DISABLED__\[\\r\\n\]+__ECHO_DISABLED__" {}
165+
timeout {
166+
send_user "ERROR: failed to disable terminal echo\n"
167+
exit 1
168+
}
169+
eof {
170+
send_user "ERROR: QEMU exited before terminal echo could be disabled\n"
171+
exit 1
172+
}
173+
}
174+
151175
set workload_steps [load_workload $workload_path]
152176
set step_index 0
153177
foreach step $workload_steps {

0 commit comments

Comments
 (0)