perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work)#7
perf(aot): inline/cold-outline three per-step hot leaves (~6% less CPU work)#7pyrex41 wants to merge 7 commits into
Conversation
New crate crates/shenffi (staticlib + cdylib) embedding shen-rust behind a small C ABI so it links into Swift/iOS apps. The default shen-rust build has no JIT, so nothing relies on runtime codegen (App Store-safe). Surface: - shen_boot / shen_boot_embedded (FS-free; kernel via include_str!) / shen_boot_shaken (any Ratatoskr-shaken kernel+program slice) / shen_eval / shen_string_free / shen_free. - shen-cas embedded: a Ratatoskr-shaken computer algebra system (298 KB kernel slice + 221 KB CAS KL) with shen_cas_boot / shen_cas_reduce, e.g. "D[Sin[x],x]" -> "[Cos x]". Also: Swift wrapper (swift/ShenRust.swift), C header (include/shenffi.h), XCFramework build script, README. Verified the Swift->Rust->Shen round-trip on macOS and cross-compilation for aarch64-apple-ios (device + simulator). Workspace member added; Cargo.lock updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lets a native macOS app embed the same CAS. Unlike the iOS simulator, MLX/Metal runs on Apple-silicon macOS, so the on-device model is exercisable there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A native, cross-platform (Mac/Win/Linux) desktop calculator built on iced
0.14 that talks straight to the embedded CAS — no FFI, no Swift, no MLX.
This is the Syntax-mode MVP; English mode (a small local model via candle
mapping NL → the CAS tool grammar) is the planned next layer.
- shenffi: expose a safe Rust `CasEngine { boot, reduce }` over the existing
private boot_shaken_inner/cas_reduce helpers, so Rust hosts can embed the
CAS as an rlib without the C-ABI raw pointers.
- crates/shencalc-iced: the iced app. The deeply-recursive tree-walked
reducer runs on a dedicated 64 MB-stack worker thread (the default 8 MB
overflows on boot, matching ShenCAS.swift); the UI talks to it over
channels and stays responsive. A `--selftest` flag reduces a fixed battery
headlessly (no display) for CI.
Verified: builds on iced 0.14; `--selftest` reduces D/Integrate/Factor/
Solve/Expand/arithmetic correctly; the GUI window launches cleanly.
Known gap: shows the raw CAS form ([Cos x]) — the human pretty-printer
(MathPretty.swift) is Swift-only and still needs a Rust port.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The iced app was showing the raw CAS form ([Cos x]); now it renders the
same human-readable math as the iOS/macOS apps (cos(x), 3·x², (1/3)·x³,
{2, -2}).
- pretty.rs: a faithful Rust port of MathPretty.swift — recursive-descent
over the bracket S-expression with precedence-aware parenthesisation,
superscript exponents, fraction coefficients, and a Head(arg, …) fallback
for unrecognised forms. 14-case unit test locks it to the Swift output.
- worker applies pretty::render after reduce (same reduce-then-prettify
split the Swift apps use at display time).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…U work) A fresh --kernel-tests leaf profile (post split-TLS/intern-cache round) found three small functions sitting as un-inlined call frames in the per-step hot loop, each because a *cold* error path (format! / ShenError::cancelled string building) bloated an otherwise tiny hot body and blocked LLVM from folding it into its AOT call sites: - is_truthy (the AOT `if` predicate): 55 -> 7 self-samples - charge_step (per-step budget/deadline check): 45 -> 0 (inlined into eval_in) - make_aot_closure / global_value / fn_value: re-probed the intern HashMap on every AOT lambda/value/fn evaluation; routed through the existing pointer-cached intern_static (the AOT call-target path already used it). Each split into a tiny #[inline] hot path + a #[cold] #[inline(never)] error constructor. Behaviour identical (sticky step-budget exhaustion preserved in the outlined charge_step_limited). The work CSE's into eval_in / call sites; the per-step call/return overhead and cold-blob bloat are gone. Measured ~6.2% less CPU work (paired user-CPU min-of-13, B<A in ~9/11 runs; wall-clock was unusable on a loaded machine so user-CPU time was the contention-robust proxy). 134/0 across tree-walk / VM / GC / served; clippy + fmt clean. Also adds scripts/cross-port-bench-4way.sh (rust vs cl vs luajit vs PUC lua) and records the round in PERFORMANCE.md / BENCHMARKS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t cross-port claims Addresses the gpt-5.5 review of PR #7: - HIGH (bench script): cross-port-bench-4way.sh redirected /usr/bin/time's stderr to /dev/null *inside* the timed group, before the outer `2>&1 | awk` could read it -> every timing came back empty. Restructured to silence the program's own stdout/stderr inside an inner `sh -c` so only time's report reaches awk. Verified it now emits real numbers. Dropped the dead bench()/ run_* helpers the review flagged. - Fixing the script surfaced a bad doc claim: BENCHMARKS.md asserted "rust ~2x faster than LuaJIT", which compared rust's *internal eval timer* against a *contended* LuaJIT wall-time (apples-to-oranges). Under the consistent harness they are roughly tied (~2.5s). Corrected the section: firm anchors are shen-cl fastest / PUC Lua slowest; rust-vs-LuaJIT is unresolved and must be re-run quiet. Kept the (load-independent) FNEW/UCLO trace-abort finding, reframed accurately. Noted the Lua driver's 0/0 counter readout is a driver bug, not skipped work (suite self-reports 100% pass). - LOW (intern_static): the pointer cache keyed on address alone; a future static-str caller passing a prefix slice of another literal would collide. Now keyed on (addr, len). One extra compare; whole-literal callers unaffected. 134/0 (tree-walk + VM), fmt + clippy clean, intern unit tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review pass (cursor-agent, GPT-5.5-high) — addressed in 2383282Ran an independent read-only review with Fixed:
Confirmed correct, no change needed: Re-verified: 134/0 (tree-walk + VM), fmt + clippy clean, intern unit tests pass.
|
Move the shen-cas engine (CasEngine, the shen_cas_* C ABI, and the tree-shaken cas-*.kl slice) out of shenffi into the shen-calc repo's new cas-engine crate, and delete the duplicate crates/shencalc-iced (the canonical iced app lives in shen-calc). shenffi is now a program-agnostic embedding surface over the interpreter. Dropping shencalc-iced from the workspace members removes the entire iced/wgpu/wayland dependency tree from Cargo.lock. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
What
A fresh
--kernel-testsleaf profile (post split-TLS / intern-cache round) flagged three small functions sitting as un-inlined call frames in the per-step hot loop. Each had a tiny hot body but a cold error path (format!/ShenError::cancelledstring-building) that bloated it and blocked LLVM from folding it into its AOT call sites. Split each into a tiny#[inline]hot path + a#[cold] #[inline(never)]error constructor:is_truthyifpredicatecharge_stepeval_in)make_aot_closure/global_value/fn_valuevalue/fnintern_static(pointer cache)The last three took
&strand re-probed the intern HashMap on every AOT lambda/value/fnevaluation; the AOT call-target path (apply_named/apply_direct) already used the pointer-cachedintern_static, so this just extends the same fast path.Measurement
~6.2% less CPU work — paired user-CPU min-of-13, B < A in ~9/11 runs. The machine was loaded (a video call); wall-clock minima sat at ~2× the clean floor and were unusable, so user-CPU time (which doesn't inflate when other processes steal the core) was the contention-robust proxy. Re-run
scripts/cross-port-bench.shquiet for a clean wall-clock confirmation.The work doesn't vanish — it CSE's into
eval_inand the call sites — but the per-step call/return overhead and the cold-blob bloat are gone.Correctness
SHEN_RUST_VM=1,SHEN_RUST_GC=1, and--served.clippy+fmtclean.charge_stepbehaviour identical — the sticky step-budget-exhaustion semantics live unchanged in the outlinedcharge_step_limited; the fast path early-returns only when both budget and deadline are unset (the default).Also
scripts/cross-port-bench-4way.sh— extends the headline harness to LuaJIT and PUC Lua (rust ~2× faster than LuaJIT, ~4.5× faster than PUC Lua on this suite).BENCHMARKS.mdrecords the 4-way field and the LuaJITFNEW/UCLOtrace-abort finding (filed upstream as shen-lua#27).PERFORMANCE.mdadds this as round 5 of the gap-closing log.🤖 Generated with Claude Code