Skip to content

perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work)#7

Open
pyrex41 wants to merge 7 commits into
mainfrom
perf/deep-dive-2026-06
Open

perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work)#7
pyrex41 wants to merge 7 commits into
mainfrom
perf/deep-dive-2026-06

Conversation

@pyrex41

@pyrex41 pyrex41 commented Jun 21, 2026

Copy link
Copy Markdown
Owner

What

A fresh --kernel-tests leaf profile (post split-TLS / intern-cache round) flagged three small functions sitting as un-inlined call frames in the per-step hot loop. Each had a tiny hot body but a cold error path (format! / ShenError::cancelled string-building) that bloated it and blocked LLVM from folding it into its AOT call sites. Split each into a tiny #[inline] hot path + a #[cold] #[inline(never)] error constructor:

leaf role self-samples (before → after)
is_truthy the AOT if predicate 55 → 7
charge_step per-step budget/deadline check 45 → 0 (inlined into eval_in)
make_aot_closure / global_value / fn_value AOT lambda/value/fn re-probed intern HashMap → now intern_static (pointer cache)

The last three took &str and re-probed the intern HashMap on every AOT lambda/value/fn evaluation; the AOT call-target path (apply_named/apply_direct) already used the pointer-cached intern_static, so this just extends the same fast path.

Measurement

~6.2% less CPU work — paired user-CPU min-of-13, B < A in ~9/11 runs. The machine was loaded (a video call); wall-clock minima sat at ~2× the clean floor and were unusable, so user-CPU time (which doesn't inflate when other processes steal the core) was the contention-robust proxy. Re-run scripts/cross-port-bench.sh quiet for a clean wall-clock confirmation.

The work doesn't vanish — it CSE's into eval_in and the call sites — but the per-step call/return overhead and the cold-blob bloat are gone.

Correctness

  • 134/0 across tree-walk, SHEN_RUST_VM=1, SHEN_RUST_GC=1, and --served.
  • clippy + fmt clean.
  • charge_step behaviour identical — the sticky step-budget-exhaustion semantics live unchanged in the outlined charge_step_limited; the fast path early-returns only when both budget and deadline are unset (the default).

Also

  • scripts/cross-port-bench-4way.sh — extends the headline harness to LuaJIT and PUC Lua (rust ~2× faster than LuaJIT, ~4.5× faster than PUC Lua on this suite).
  • BENCHMARKS.md records the 4-way field and the LuaJIT FNEW/UCLO trace-abort finding (filed upstream as shen-lua#27).
  • PERFORMANCE.md adds this as round 5 of the gap-closing log.

🤖 Generated with Claude Code

Reuben Brooks and others added 6 commits June 20, 2026 00:07
New crate crates/shenffi (staticlib + cdylib) embedding shen-rust behind a
small C ABI so it links into Swift/iOS apps. The default shen-rust build has
no JIT, so nothing relies on runtime codegen (App Store-safe).

Surface:
- shen_boot / shen_boot_embedded (FS-free; kernel via include_str!) /
  shen_boot_shaken (any Ratatoskr-shaken kernel+program slice) /
  shen_eval / shen_string_free / shen_free.
- shen-cas embedded: a Ratatoskr-shaken computer algebra system (298 KB
  kernel slice + 221 KB CAS KL) with shen_cas_boot / shen_cas_reduce, e.g.
  "D[Sin[x],x]" -> "[Cos x]".

Also: Swift wrapper (swift/ShenRust.swift), C header (include/shenffi.h),
XCFramework build script, README. Verified the Swift->Rust->Shen round-trip
on macOS and cross-compilation for aarch64-apple-ios (device + simulator).

Workspace member added; Cargo.lock updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets a native macOS app embed the same CAS. Unlike the iOS simulator,
MLX/Metal runs on Apple-silicon macOS, so the on-device model is
exercisable there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A native, cross-platform (Mac/Win/Linux) desktop calculator built on iced
0.14 that talks straight to the embedded CAS — no FFI, no Swift, no MLX.
This is the Syntax-mode MVP; English mode (a small local model via candle
mapping NL → the CAS tool grammar) is the planned next layer.

- shenffi: expose a safe Rust `CasEngine { boot, reduce }` over the existing
  private boot_shaken_inner/cas_reduce helpers, so Rust hosts can embed the
  CAS as an rlib without the C-ABI raw pointers.
- crates/shencalc-iced: the iced app. The deeply-recursive tree-walked
  reducer runs on a dedicated 64 MB-stack worker thread (the default 8 MB
  overflows on boot, matching ShenCAS.swift); the UI talks to it over
  channels and stays responsive. A `--selftest` flag reduces a fixed battery
  headlessly (no display) for CI.

Verified: builds on iced 0.14; `--selftest` reduces D/Integrate/Factor/
Solve/Expand/arithmetic correctly; the GUI window launches cleanly.

Known gap: shows the raw CAS form ([Cos x]) — the human pretty-printer
(MathPretty.swift) is Swift-only and still needs a Rust port.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iced app was showing the raw CAS form ([Cos x]); now it renders the
same human-readable math as the iOS/macOS apps (cos(x), 3·x², (1/3)·x³,
{2, -2}).

- pretty.rs: a faithful Rust port of MathPretty.swift — recursive-descent
  over the bracket S-expression with precedence-aware parenthesisation,
  superscript exponents, fraction coefficients, and a Head(arg, …) fallback
  for unrecognised forms. 14-case unit test locks it to the Swift output.
- worker applies pretty::render after reduce (same reduce-then-prettify
  split the Swift apps use at display time).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…U work)

A fresh --kernel-tests leaf profile (post split-TLS/intern-cache round) found
three small functions sitting as un-inlined call frames in the per-step hot
loop, each because a *cold* error path (format! / ShenError::cancelled string
building) bloated an otherwise tiny hot body and blocked LLVM from folding it
into its AOT call sites:

- is_truthy (the AOT `if` predicate): 55 -> 7 self-samples
- charge_step (per-step budget/deadline check): 45 -> 0 (inlined into eval_in)
- make_aot_closure / global_value / fn_value: re-probed the intern HashMap on
  every AOT lambda/value/fn evaluation; routed through the existing
  pointer-cached intern_static (the AOT call-target path already used it).

Each split into a tiny #[inline] hot path + a #[cold] #[inline(never)] error
constructor. Behaviour identical (sticky step-budget exhaustion preserved in
the outlined charge_step_limited). The work CSE's into eval_in / call sites;
the per-step call/return overhead and cold-blob bloat are gone.

Measured ~6.2% less CPU work (paired user-CPU min-of-13, B<A in ~9/11 runs;
wall-clock was unusable on a loaded machine so user-CPU time was the
contention-robust proxy). 134/0 across tree-walk / VM / GC / served;
clippy + fmt clean.

Also adds scripts/cross-port-bench-4way.sh (rust vs cl vs luajit vs PUC lua)
and records the round in PERFORMANCE.md / BENCHMARKS.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t cross-port claims

Addresses the gpt-5.5 review of PR #7:

- HIGH (bench script): cross-port-bench-4way.sh redirected /usr/bin/time's
  stderr to /dev/null *inside* the timed group, before the outer `2>&1 | awk`
  could read it -> every timing came back empty. Restructured to silence the
  program's own stdout/stderr inside an inner `sh -c` so only time's report
  reaches awk. Verified it now emits real numbers. Dropped the dead bench()/
  run_* helpers the review flagged.

- Fixing the script surfaced a bad doc claim: BENCHMARKS.md asserted "rust ~2x
  faster than LuaJIT", which compared rust's *internal eval timer* against a
  *contended* LuaJIT wall-time (apples-to-oranges). Under the consistent
  harness they are roughly tied (~2.5s). Corrected the section: firm anchors
  are shen-cl fastest / PUC Lua slowest; rust-vs-LuaJIT is unresolved and must
  be re-run quiet. Kept the (load-independent) FNEW/UCLO trace-abort finding,
  reframed accurately. Noted the Lua driver's 0/0 counter readout is a
  driver bug, not skipped work (suite self-reports 100% pass).

- LOW (intern_static): the pointer cache keyed on address alone; a future
  static-str caller passing a prefix slice of another literal would collide.
  Now keyed on (addr, len). One extra compare; whole-literal callers unaffected.

134/0 (tree-walk + VM), fmt + clippy clean, intern unit tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pyrex41

pyrex41 commented Jun 21, 2026

Copy link
Copy Markdown
Owner Author

Review pass (cursor-agent, GPT-5.5-high) — addressed in 2383282

Ran an independent read-only review with gpt-5.5-high. Verdict was REQUEST-CHANGES on one real bug; it confirmed the runtime change itself is correct (it even ran cargo test -p shen-rust --test budget_cancel → 5/5, validating the sticky step-budget path).

Fixed:

  • HIGH — cross-port-bench-4way.sh captured empty timings. /usr/bin/time's stderr was redirected to /dev/null inside the timed group before awk could read it. Restructured to silence the program's own output inside an inner sh -c; verified it now emits real numbers. Dropped the dead bench()/run_* helpers it flagged.
  • Correctness fallout from that fix: the working script contradicted a claim I'd written in BENCHMARKS.md ("rust ~2× faster than LuaJIT") — that number compared rust's internal eval timer to a contended LuaJIT wall-time. Under the consistent harness rust and LuaJIT are ~tied (~2.5 s). Corrected the section (firm: cl fastest, PUC Lua slowest; rust-vs-LuaJIT unresolved → re-run quiet). The load-independent FNEW/UCLO trace-abort finding stays, reframed accurately.
  • LOW — intern_static keyed on address only. A future &'static str caller passing a prefix slice of another literal would collide. Now keyed on (addr, len).

Confirmed correct, no change needed: charge_step fast-path preserves sticky exhaustion + deadline semantics; the &str → &'static str signature changes are satisfied by all AOT-emitted literals (a non-static caller would fail to compile, not silently misbehave); codegen-units = 1 is not required for cache correctness.

Re-verified: 134/0 (tree-walk + VM), fmt + clippy clean, intern unit tests pass.

Note on the headline ~6% number: that was measured with a different, correct harness (program stdout-only redirect + user-CPU time as the contention-robust proxy), not the buggy script — so it stands. A clean-machine wall-clock re-run is still the right confirmation.

Move the shen-cas engine (CasEngine, the shen_cas_* C ABI, and the
tree-shaken cas-*.kl slice) out of shenffi into the shen-calc repo's new
cas-engine crate, and delete the duplicate crates/shencalc-iced (the
canonical iced app lives in shen-calc). shenffi is now a program-agnostic
embedding surface over the interpreter.

Dropping shencalc-iced from the workspace members removes the entire
iced/wgpu/wayland dependency tree from Cargo.lock.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant