Skip to content

Add selective TLBI and per-vCPU accumulator#24

Merged
jserv merged 1 commit into
mainfrom
tlb
May 11, 2026
Merged

Add selective TLBI and per-vCPU accumulator#24
jserv merged 1 commit into
mainfrom
tlb

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 11, 2026

Replace the blanket TLBI VMALLE1IS that ran after every page-table- modifying syscall with a per-VA TLBI VAE1IS path bounded by 16 pages, upgrading to broadcast for larger ranges. Common cases (RELRO mprotect, small munmap, MAP_FIXED PROT_NONE invalidation) now keep unrelated TLB entries alive across the syscall return.

Stage requests on a per-vCPU TLS slot (cpu_tlbi_req in core/guest.h) rather than a guest-global accumulator. A global slot let one vCPU's syscall epilogue drain another vCPU's pending request before the second vCPU eret'd back to EL0, leaving stale translations live until the broadcast TLBI from the first vCPU caught up. With per-vCPU TLS each thread strictly owns its own request and no concurrent vCPU can read, clear, or partially observe it. The slot is C11 _Thread_local, so fork-child and CLONE_THREAD workers start with TLBI_NONE for free.

Extend the X8 wire protocol after HVC #5: 0 skips the flush, 1 keeps the broadcast meaning, 2 stays reserved for the execve drop-frame marker the shim handles separately, and 3 selects the new selective path with X9 carrying the page-aligned start VA and X10 the page count. The shim's tlbi_selective branch issues TLBI VAE1IS in a loop with a defensive cbz x10 guard against a stray zero-count request, and tails with DSB ISH + IC IALLU + DSB + ISB so callers like file-backed mmap of executable pages still see the same I-cache invalidation as the broadcast path.

Switch the W^X HVC #9 fault handler in shim.S to single-page TLBI VAE1IS using FAR_EL1. Per ARM ARM B2.2.5.6, TLBI VAE1IS for any VA invalidates every cached entry containing that VA, so the per-page TLBI also retires any 2 MiB block entry the prior split_l2_block left behind. guest_split_block therefore no longer requests a separate TLBI: every caller follows it with guest_invalidate_ptes or guest_update_perms on the actually-changing range, and that subsequent per-page TLBI is sufficient.

guest_update_perms now tracks the smallest sub-range whose L3 descriptor actually changed and only requests TLBI for that sub-range, eliminating the broadcast-on-no-op false positive previously emitted by adjacent same-perm mprotect storms (the common shape of dynamic- linker RELRO).

Clear the per-vCPU slot at the end of guest_bootstrap_create_vcpu: guest_build_page_tables and the boot-time guest_invalidate_ptes calls (stack guard, null page) accumulate TLBI requests on the main thread's TLS, but the shim's _start does its own TLBI VMALLE1IS before enabling the MMU, so the first guest syscall must not redundantly broadcast on top.


Summary by cubic

Replaces always-broadcast TLB flushes with selective per-VA invalidation and a per‑vCPU accumulator to reduce flush cost and avoid cross‑vCPU races.

  • New Features
    • Per‑vCPU _Thread_local cpu_tlbi_req accumulates TLB work; syscall epilogue maps to X8/X9/X10 and then clears it. Bootstrap and execve also clear the slot.
    • Selective TLBI: use VAE1IS for up to 16 pages, otherwise VMALLE1IS. Extends HVC Fix EPOLLONESHOT leaks in epoll bridge #5 protocol: X8=0 (none), X8=1 (broadcast), X8=2 (execve), X8=3 (range via X9 start VA and X10 page count). Shim issues per‑page TLBI loop and tails with DSB ISH + IC IALLU + DSB + ISB to preserve I‑cache behavior.
    • W^X handler now flushes only the faulting page using FAR_EL1 (single TLBI VAE1IS). guest_split_block no longer requests its own TLBI; subsequent per‑page invalidation retires the old block entry. guest_update_perms tracks the minimal changed sub‑range and only requests TLBI for that range.

Written for commit e13ade1. Summary will update on new commits.

Replace the blanket TLBI VMALLE1IS that ran after every page-table-
modifying syscall with a per-VA TLBI VAE1IS path bounded by 16 pages,
upgrading to broadcast for larger ranges. Common cases (RELRO mprotect,
small munmap, MAP_FIXED PROT_NONE invalidation) now keep unrelated TLB
entries alive across the syscall return.

Stage requests on a per-vCPU TLS slot (cpu_tlbi_req in core/guest.h)
rather than a guest-global accumulator. A global slot let one vCPU's
syscall epilogue drain another vCPU's pending request before the second
vCPU eret'd back to EL0, leaving stale translations live until the
broadcast TLBI from the first vCPU caught up. With per-vCPU TLS each
thread strictly owns its own request and no concurrent vCPU can read,
clear, or partially observe it. The slot is C11 _Thread_local, so
fork-child and CLONE_THREAD workers start with TLBI_NONE for free.

Extend the X8 wire protocol after HVC #5: 0 skips the flush, 1 keeps
the broadcast meaning, 2 stays reserved for the execve drop-frame
marker the shim handles separately, and 3 selects the new selective
path with X9 carrying the page-aligned start VA and X10 the page count.
The shim's tlbi_selective branch issues TLBI VAE1IS in a loop with a
defensive cbz x10 guard against a stray zero-count request, and tails
with DSB ISH + IC IALLU + DSB + ISB so callers like file-backed mmap
of executable pages still see the same I-cache invalidation as the
broadcast path.

Switch the W^X HVC #9 fault handler in shim.S to single-page TLBI
VAE1IS using FAR_EL1. Per ARM ARM B2.2.5.6, TLBI VAE1IS for any VA
invalidates every cached entry containing that VA, so the per-page
TLBI also retires any 2 MiB block entry the prior split_l2_block left
behind. guest_split_block therefore no longer requests a separate
TLBI: every caller follows it with guest_invalidate_ptes or
guest_update_perms on the actually-changing range, and that subsequent
per-page TLBI is sufficient.

guest_update_perms now tracks the smallest sub-range whose L3
descriptor actually changed and only requests TLBI for that sub-range,
eliminating the broadcast-on-no-op false positive previously emitted
by adjacent same-perm mprotect storms (the common shape of dynamic-
linker RELRO).

Clear the per-vCPU slot at the end of guest_bootstrap_create_vcpu:
guest_build_page_tables and the boot-time guest_invalidate_ptes calls
(stack guard, null page) accumulate TLBI requests on the main thread's
TLS, but the shim's _start does its own TLBI VMALLE1IS before enabling
the MMU, so the first guest syscall must not redundantly broadcast on
top.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

@jserv jserv merged commit 4bea0ad into main May 11, 2026
5 checks passed
@jserv jserv deleted the tlb branch May 11, 2026 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant