Add selective TLBI and per-vCPU accumulator by jserv · Pull Request #24 · sysprog21/elfuse

jserv · 2026-05-11T02:29:10Z

Replace the blanket TLBI VMALLE1IS that ran after every page-table- modifying syscall with a per-VA TLBI VAE1IS path bounded by 16 pages, upgrading to broadcast for larger ranges. Common cases (RELRO mprotect, small munmap, MAP_FIXED PROT_NONE invalidation) now keep unrelated TLB entries alive across the syscall return.

Stage requests on a per-vCPU TLS slot (cpu_tlbi_req in core/guest.h) rather than a guest-global accumulator. A global slot let one vCPU's syscall epilogue drain another vCPU's pending request before the second vCPU eret'd back to EL0, leaving stale translations live until the broadcast TLBI from the first vCPU caught up. With per-vCPU TLS each thread strictly owns its own request and no concurrent vCPU can read, clear, or partially observe it. The slot is C11 _Thread_local, so fork-child and CLONE_THREAD workers start with TLBI_NONE for free.

Extend the X8 wire protocol after HVC #5: 0 skips the flush, 1 keeps the broadcast meaning, 2 stays reserved for the execve drop-frame marker the shim handles separately, and 3 selects the new selective path with X9 carrying the page-aligned start VA and X10 the page count. The shim's tlbi_selective branch issues TLBI VAE1IS in a loop with a defensive cbz x10 guard against a stray zero-count request, and tails with DSB ISH + IC IALLU + DSB + ISB so callers like file-backed mmap of executable pages still see the same I-cache invalidation as the broadcast path.

Switch the W^X HVC #9 fault handler in shim.S to single-page TLBI VAE1IS using FAR_EL1. Per ARM ARM B2.2.5.6, TLBI VAE1IS for any VA invalidates every cached entry containing that VA, so the per-page TLBI also retires any 2 MiB block entry the prior split_l2_block left behind. guest_split_block therefore no longer requests a separate TLBI: every caller follows it with guest_invalidate_ptes or guest_update_perms on the actually-changing range, and that subsequent per-page TLBI is sufficient.

guest_update_perms now tracks the smallest sub-range whose L3 descriptor actually changed and only requests TLBI for that sub-range, eliminating the broadcast-on-no-op false positive previously emitted by adjacent same-perm mprotect storms (the common shape of dynamic- linker RELRO).

Clear the per-vCPU slot at the end of guest_bootstrap_create_vcpu: guest_build_page_tables and the boot-time guest_invalidate_ptes calls (stack guard, null page) accumulate TLBI requests on the main thread's TLS, but the shim's _start does its own TLBI VMALLE1IS before enabling the MMU, so the first guest syscall must not redundantly broadcast on top.

Summary by cubic

Replaces always-broadcast TLB flushes with selective per-VA invalidation and a per‑vCPU accumulator to reduce flush cost and avoid cross‑vCPU races.

New Features
- Per‑vCPU _Thread_local cpu_tlbi_req accumulates TLB work; syscall epilogue maps to X8/X9/X10 and then clears it. Bootstrap and execve also clear the slot.
- Selective TLBI: use VAE1IS for up to 16 pages, otherwise VMALLE1IS. Extends HVC Fix EPOLLONESHOT leaks in epoll bridge #5 protocol: X8=0 (none), X8=1 (broadcast), X8=2 (execve), X8=3 (range via X9 start VA and X10 page count). Shim issues per‑page TLBI loop and tails with DSB ISH + IC IALLU + DSB + ISB to preserve I‑cache behavior.
- W^X handler now flushes only the faulting page using FAR_EL1 (single TLBI VAE1IS). guest_split_block no longer requests its own TLBI; subsequent per‑page invalidation retires the old block entry. guest_update_perms tracks the minimal changed sub‑range and only requests TLBI for that range.

^{Written for commit e13ade1. Summary will update on new commits.}

Replace the blanket TLBI VMALLE1IS that ran after every page-table- modifying syscall with a per-VA TLBI VAE1IS path bounded by 16 pages, upgrading to broadcast for larger ranges. Common cases (RELRO mprotect, small munmap, MAP_FIXED PROT_NONE invalidation) now keep unrelated TLB entries alive across the syscall return. Stage requests on a per-vCPU TLS slot (cpu_tlbi_req in core/guest.h) rather than a guest-global accumulator. A global slot let one vCPU's syscall epilogue drain another vCPU's pending request before the second vCPU eret'd back to EL0, leaving stale translations live until the broadcast TLBI from the first vCPU caught up. With per-vCPU TLS each thread strictly owns its own request and no concurrent vCPU can read, clear, or partially observe it. The slot is C11 _Thread_local, so fork-child and CLONE_THREAD workers start with TLBI_NONE for free. Extend the X8 wire protocol after HVC #5: 0 skips the flush, 1 keeps the broadcast meaning, 2 stays reserved for the execve drop-frame marker the shim handles separately, and 3 selects the new selective path with X9 carrying the page-aligned start VA and X10 the page count. The shim's tlbi_selective branch issues TLBI VAE1IS in a loop with a defensive cbz x10 guard against a stray zero-count request, and tails with DSB ISH + IC IALLU + DSB + ISB so callers like file-backed mmap of executable pages still see the same I-cache invalidation as the broadcast path. Switch the W^X HVC #9 fault handler in shim.S to single-page TLBI VAE1IS using FAR_EL1. Per ARM ARM B2.2.5.6, TLBI VAE1IS for any VA invalidates every cached entry containing that VA, so the per-page TLBI also retires any 2 MiB block entry the prior split_l2_block left behind. guest_split_block therefore no longer requests a separate TLBI: every caller follows it with guest_invalidate_ptes or guest_update_perms on the actually-changing range, and that subsequent per-page TLBI is sufficient. guest_update_perms now tracks the smallest sub-range whose L3 descriptor actually changed and only requests TLBI for that sub-range, eliminating the broadcast-on-no-op false positive previously emitted by adjacent same-perm mprotect storms (the common shape of dynamic- linker RELRO). Clear the per-vCPU slot at the end of guest_bootstrap_create_vcpu: guest_build_page_tables and the boot-time guest_invalidate_ptes calls (stack guard, null page) accumulate TLBI requests on the main thread's TLS, but the shim's _start does its own TLBI VMALLE1IS before enabling the MMU, so the first guest syscall must not redundantly broadcast on top.

cubic-dev-ai

No issues found across 8 files

cubic-dev-ai Bot reviewed May 11, 2026

View reviewed changes

jserv merged commit 4bea0ad into main May 11, 2026
5 checks passed

jserv deleted the tlb branch May 11, 2026 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add selective TLBI and per-vCPU accumulator#24

Add selective TLBI and per-vCPU accumulator#24
jserv merged 1 commit into
mainfrom
tlb

jserv commented May 11, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jserv commented May 11, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jserv commented May 11, 2026 •

edited by cubic-dev-ai Bot

Loading