perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work) by pyrex41 · Pull Request #7 · pyrex41/shen-rust

pyrex41 · 2026-06-21T03:44:33Z

What

A fresh --kernel-tests leaf profile (post split-TLS / intern-cache round) flagged three small functions sitting as un-inlined call frames in the per-step hot loop. Each had a tiny hot body but a cold error path (format! / ShenError::cancelled string-building) that bloated it and blocked LLVM from folding it into its AOT call sites. Split each into a tiny #[inline] hot path + a #[cold] #[inline(never)] error constructor:

leaf	role	self-samples (before → after)
`is_truthy`	the AOT `if` predicate	55 → 7
`charge_step`	per-step budget/deadline check	45 → 0 (inlined into `eval_in`)
`make_aot_closure` / `global_value` / `fn_value`	AOT lambda/`value`/`fn`	re-probed intern HashMap → now `intern_static` (pointer cache)

The last three took &str and re-probed the intern HashMap on every AOT lambda/value/fn evaluation; the AOT call-target path (apply_named/apply_direct) already used the pointer-cached intern_static, so this just extends the same fast path.

Measurement

~6.2% less CPU work — paired user-CPU min-of-13, B < A in ~9/11 runs. The machine was loaded (a video call); wall-clock minima sat at ~2× the clean floor and were unusable, so user-CPU time (which doesn't inflate when other processes steal the core) was the contention-robust proxy. Re-run scripts/cross-port-bench.sh quiet for a clean wall-clock confirmation.

The work doesn't vanish — it CSE's into eval_in and the call sites — but the per-step call/return overhead and the cold-blob bloat are gone.

Correctness

134/0 across tree-walk, SHEN_RUST_VM=1, SHEN_RUST_GC=1, and --served.
clippy + fmt clean.
charge_step behaviour identical — the sticky step-budget-exhaustion semantics live unchanged in the outlined charge_step_limited; the fast path early-returns only when both budget and deadline are unset (the default).

Also

scripts/cross-port-bench-4way.sh — extends the headline harness to LuaJIT and PUC Lua (rust ~2× faster than LuaJIT, ~4.5× faster than PUC Lua on this suite).
BENCHMARKS.md records the 4-way field and the LuaJIT FNEW/UCLO trace-abort finding (filed upstream as shen-lua#27).
PERFORMANCE.md adds this as round 5 of the gap-closing log.

🤖 Generated with Claude Code

New crate crates/shenffi (staticlib + cdylib) embedding shen-rust behind a small C ABI so it links into Swift/iOS apps. The default shen-rust build has no JIT, so nothing relies on runtime codegen (App Store-safe). Surface: - shen_boot / shen_boot_embedded (FS-free; kernel via include_str!) / shen_boot_shaken (any Ratatoskr-shaken kernel+program slice) / shen_eval / shen_string_free / shen_free. - shen-cas embedded: a Ratatoskr-shaken computer algebra system (298 KB kernel slice + 221 KB CAS KL) with shen_cas_boot / shen_cas_reduce, e.g. "D[Sin[x],x]" -> "[Cos x]". Also: Swift wrapper (swift/ShenRust.swift), C header (include/shenffi.h), XCFramework build script, README. Verified the Swift->Rust->Shen round-trip on macOS and cross-compilation for aarch64-apple-ios (device + simulator). Workspace member added; Cargo.lock updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Lets a native macOS app embed the same CAS. Unlike the iOS simulator, MLX/Metal runs on Apple-silicon macOS, so the on-device model is exercisable there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

A native, cross-platform (Mac/Win/Linux) desktop calculator built on iced 0.14 that talks straight to the embedded CAS — no FFI, no Swift, no MLX. This is the Syntax-mode MVP; English mode (a small local model via candle mapping NL → the CAS tool grammar) is the planned next layer. - shenffi: expose a safe Rust `CasEngine { boot, reduce }` over the existing private boot_shaken_inner/cas_reduce helpers, so Rust hosts can embed the CAS as an rlib without the C-ABI raw pointers. - crates/shencalc-iced: the iced app. The deeply-recursive tree-walked reducer runs on a dedicated 64 MB-stack worker thread (the default 8 MB overflows on boot, matching ShenCAS.swift); the UI talks to it over channels and stays responsive. A `--selftest` flag reduces a fixed battery headlessly (no display) for CI. Verified: builds on iced 0.14; `--selftest` reduces D/Integrate/Factor/ Solve/Expand/arithmetic correctly; the GUI window launches cleanly. Known gap: shows the raw CAS form ([Cos x]) — the human pretty-printer (MathPretty.swift) is Swift-only and still needs a Rust port. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The iced app was showing the raw CAS form ([Cos x]); now it renders the same human-readable math as the iOS/macOS apps (cos(x), 3·x², (1/3)·x³, {2, -2}). - pretty.rs: a faithful Rust port of MathPretty.swift — recursive-descent over the bracket S-expression with precedence-aware parenthesisation, superscript exponents, fraction coefficients, and a Head(arg, …) fallback for unrecognised forms. 14-case unit test locks it to the Swift output. - worker applies pretty::render after reduce (same reduce-then-prettify split the Swift apps use at display time). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…U work) A fresh --kernel-tests leaf profile (post split-TLS/intern-cache round) found three small functions sitting as un-inlined call frames in the per-step hot loop, each because a *cold* error path (format! / ShenError::cancelled string building) bloated an otherwise tiny hot body and blocked LLVM from folding it into its AOT call sites: - is_truthy (the AOT `if` predicate): 55 -> 7 self-samples - charge_step (per-step budget/deadline check): 45 -> 0 (inlined into eval_in) - make_aot_closure / global_value / fn_value: re-probed the intern HashMap on every AOT lambda/value/fn evaluation; routed through the existing pointer-cached intern_static (the AOT call-target path already used it). Each split into a tiny #[inline] hot path + a #[cold] #[inline(never)] error constructor. Behaviour identical (sticky step-budget exhaustion preserved in the outlined charge_step_limited). The work CSE's into eval_in / call sites; the per-step call/return overhead and cold-blob bloat are gone. Measured ~6.2% less CPU work (paired user-CPU min-of-13, B<A in ~9/11 runs; wall-clock was unusable on a loaded machine so user-CPU time was the contention-robust proxy). 134/0 across tree-walk / VM / GC / served; clippy + fmt clean. Also adds scripts/cross-port-bench-4way.sh (rust vs cl vs luajit vs PUC lua) and records the round in PERFORMANCE.md / BENCHMARKS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…t cross-port claims Addresses the gpt-5.5 review of PR #7: - HIGH (bench script): cross-port-bench-4way.sh redirected /usr/bin/time's stderr to /dev/null *inside* the timed group, before the outer `2>&1 | awk` could read it -> every timing came back empty. Restructured to silence the program's own stdout/stderr inside an inner `sh -c` so only time's report reaches awk. Verified it now emits real numbers. Dropped the dead bench()/ run_* helpers the review flagged. - Fixing the script surfaced a bad doc claim: BENCHMARKS.md asserted "rust ~2x faster than LuaJIT", which compared rust's *internal eval timer* against a *contended* LuaJIT wall-time (apples-to-oranges). Under the consistent harness they are roughly tied (~2.5s). Corrected the section: firm anchors are shen-cl fastest / PUC Lua slowest; rust-vs-LuaJIT is unresolved and must be re-run quiet. Kept the (load-independent) FNEW/UCLO trace-abort finding, reframed accurately. Noted the Lua driver's 0/0 counter readout is a driver bug, not skipped work (suite self-reports 100% pass). - LOW (intern_static): the pointer cache keyed on address alone; a future static-str caller passing a prefix slice of another literal would collide. Now keyed on (addr, len). One extra compare; whole-literal callers unaffected. 134/0 (tree-walk + VM), fmt + clippy clean, intern unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pyrex41 · 2026-06-21T04:20:30Z

Review pass (cursor-agent, GPT-5.5-high) — addressed in `2383282`

Ran an independent read-only review with gpt-5.5-high. Verdict was REQUEST-CHANGES on one real bug; it confirmed the runtime change itself is correct (it even ran cargo test -p shen-rust --test budget_cancel → 5/5, validating the sticky step-budget path).

Fixed:

HIGH — cross-port-bench-4way.sh captured empty timings. /usr/bin/time's stderr was redirected to /dev/null inside the timed group before awk could read it. Restructured to silence the program's own output inside an inner sh -c; verified it now emits real numbers. Dropped the dead bench()/run_* helpers it flagged.
Correctness fallout from that fix: the working script contradicted a claim I'd written in BENCHMARKS.md ("rust ~2× faster than LuaJIT") — that number compared rust's internal eval timer to a contended LuaJIT wall-time. Under the consistent harness rust and LuaJIT are ~tied (~2.5 s). Corrected the section (firm: cl fastest, PUC Lua slowest; rust-vs-LuaJIT unresolved → re-run quiet). The load-independent FNEW/UCLO trace-abort finding stays, reframed accurately.
LOW — intern_static keyed on address only. A future &'static str caller passing a prefix slice of another literal would collide. Now keyed on (addr, len).

Confirmed correct, no change needed: charge_step fast-path preserves sticky exhaustion + deadline semantics; the &str → &'static str signature changes are satisfied by all AOT-emitted literals (a non-static caller would fail to compile, not silently misbehave); codegen-units = 1 is not required for cache correctness.

Re-verified: 134/0 (tree-walk + VM), fmt + clippy clean, intern unit tests pass.

Note on the headline ~6% number: that was measured with a different, correct harness (program stdout-only redirect + user-CPU time as the contention-robust proxy), not the buggy script — so it stands. A clean-machine wall-clock re-run is still the right confirmation.

Move the shen-cas engine (CasEngine, the shen_cas_* C ABI, and the tree-shaken cas-*.kl slice) out of shenffi into the shen-calc repo's new cas-engine crate, and delete the duplicate crates/shencalc-iced (the canonical iced app lives in shen-calc). shenffi is now a program-agnostic embedding surface over the interpreter. Dropping shencalc-iced from the workspace members removes the entire iced/wgpu/wayland dependency tree from Cargo.lock. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Reuben Brooks and others added 6 commits June 20, 2026 00:07

Add macos-arm64 slice to ShenRust.xcframework build

c401739

Lets a native macOS app embed the same CAS. Unlike the iOS simulator, MLX/Metal runs on Apple-silicon macOS, so the on-device model is exercisable there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work)#7

perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work)#7
pyrex41 wants to merge 7 commits into
mainfrom
perf/deep-dive-2026-06

pyrex41 commented Jun 21, 2026

Uh oh!

pyrex41 commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyrex41 commented Jun 21, 2026

What

Measurement

Correctness

Also

Uh oh!

pyrex41 commented Jun 21, 2026

Review pass (cursor-agent, GPT-5.5-high) — addressed in 2383282

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Review pass (cursor-agent, GPT-5.5-high) — addressed in `2383282`