This page consolidates the public experiment bundle that supports the current accepted host surface.
| Surface | IFEval | Official LogiQA | External Dev | External Blind |
|---|---|---|---|---|
| Frozen authoritative parent | 0.780037 |
0.303738 |
0.304598 |
0.425072 |
| Current no-trunk ablation | 0.780037 |
0.303738 |
0.304598 |
0.425072 |
| Earlier three-family promoted surface | 0.780037 |
0.308411 |
0.304598 |
0.425072 |
| Single-family official BC surface | 0.780037 |
0.348910 |
0.308908 |
0.425072 |
| Single-family official DBB-BC surface | 0.780037 |
0.353583 |
0.308908 |
0.425072 |
| Current accepted host surface | 0.780037 |
0.392523 |
0.308908 |
0.425072 |
Source:
Interpretation:
- the current accepted host surface improves official
LogiQAby+0.088785over the frozen parent - the no-trunk ablation collapses exactly back to the parent boundary
- replayable same-parent slices improve on the parent but do not subsume the current surface
Public second-task validation uses the same authoritative IFEval path already carried by the official dual-bench reports.
Result:
- frozen parent
IFEval = 0.780037 - no-trunk ablation
IFEval = 0.780037 - earlier three-family surface
IFEval = 0.780037 - single-family BC surface
IFEval = 0.780037 - single-family DBB-BC surface
IFEval = 0.780037 - current accepted host surface
IFEval = 0.780037
Interpretation:
- the published same-parent progression does not buy
LogiQAgain by sacrificing the public instruction-following task - this is valid cross-task non-regression evidence
- this is not yet evidence of positive transfer or broad multi-task generalization
Source:
The repo now also publishes a true parent-derived second-task capability line on a separate small industrial-state decision benchmark.
Result on the published 8-sample industrial_state_decision_logiqa_v2 benchmark:
- frozen parent
0.25 - exact no-trunk ablation
0.25 - same-parent trained linear baseline
0.25 - current accepted host surface
0.375 - parent-derived independent industrial second line
0.625
Pairwise independent-line replay:
- independent vs parent:
improved = 3,harmed = 0 - independent vs no-trunk:
improved = 3,harmed = 0 - independent vs trained linear:
improved = 3,harmed = 0 - independent vs current surface:
improved = 2,harmed = 0
Interpretation:
- this is the clearest current public evidence that the same frozen-parent trunk pipeline can instantiate a new second-task capability line rather than only improve the current
LogiQAsurface - exact no-trunk and trained linear controls remain at the frozen-parent level on this benchmark
- the current accepted host surface transfers somewhat, but the dedicated parent-derived industrial line is stronger
- because the benchmark is still small, this should be read as second-task capability-production evidence rather than proof of broad industrial generalization
Source:
The repo also freezes a public protocol for the industrial second line:
- train source:
industrial_wrapper_protected_slice_v1(18samples) - external dev:
industrial_state_decision_logiqa_lite_v1(10samples) - blind slice:
industrial_state_decision_logiqa_v2(8samples) - non-regression monitor: official
IFEval
Candidate under the frozen protocol:
- checkpoint: independent industrial second-line subset
5+6+8+10 - dev
= 0.2 - blind
= 0.5 - official
IFEval = 0.781885
Frozen external-union summary after protocol freeze:
- external union samples
= 18 - candidate
= 0.333333 - frozen parent
= 0.166667 - exact no-trunk
= 0.166667 - trained linear baseline
= 0.166667 - transferred current surface
= 0.277778
External-union pairwise replay:
- candidate vs parent:
improved = 3,harmed = 0 - candidate vs no-trunk:
improved = 3,harmed = 0 - candidate vs trained linear:
improved = 3,harmed = 0 - candidate vs current surface:
improved = 1,harmed = 0
Controls under the same protocol:
- frozen parent: dev
0.1, blind0.25 - exact no-trunk: dev
0.1, blind0.25 - trained linear baseline: dev
0.1, blind0.25 - current accepted host surface: dev
0.2, blind0.375
Gate result:
- dev above parent: pass
- dev above no-trunk: pass
- dev above trained linear: pass
- dev at least current: pass
- blind above parent: pass
- blind above no-trunk: pass
- blind above trained linear: pass
- blind above current: pass
IFEvalnon-regression vs parent: pass- overall protocol gate: pass
Interpretation:
- this is the current strongest public evidence that the same frozen-parent trunk pipeline can produce a new second-task capability line under a fixed auditable protocol
- the protocol is stronger than a single held-out score because it fixes train source, external dev, blind slice, and a non-regression monitor
- the frozen external union thickens the evidence beyond a single blind slice: after protocol freeze, the candidate stays above parent, no-trunk, trained linear, and the transferred current surface across all
18published external samples - the benchmark is still small, so this should be presented as bounded capability-production evidence rather than proof of broad industrial generalization
Source:
The repo now also publishes a small second-task probe that wraps current-win reasoning shapes in industrial and embedded review language while preserving a LogiQA-style multiple-choice reasoning surface.
Result on the published 10-sample probe:
- frozen parent
0.1 - exact no-trunk ablation
0.1 - same-parent trained linear baseline
0.1 - current accepted host surface
0.2
Pairwise current-versus-control replay:
- current vs parent:
improved = 1,harmed = 0 - current vs no-trunk:
improved = 1,harmed = 0 - current vs trained linear:
improved = 1,harmed = 0
Interpretation:
- this is a weak but positive task-transfer signal under a light domain shift toward deployment-shaped language
- the gain still localizes to the promoted trunk blocks because exact no-trunk falls back to the frozen-parent level
- the same-parent trained linear baseline also stays at the frozen-parent level
- this is still an exploratory transfer probe, not a claim of broad industrial generalization
Source:
The repo also publishes a curated protected slice built from public current-win reasoning shapes and wrapped in industrial or embedded review language.
Result on the published 18-sample protected slice:
- frozen parent
0.166667 - exact no-trunk ablation
0.166667 - same-parent trained linear baseline
0.166667 - current accepted host surface
0.333333
Pairwise current-versus-control replay:
- current vs parent:
improved = 3,harmed = 0 - current vs no-trunk:
improved = 3,harmed = 0 - current vs trained linear:
improved = 3,harmed = 0
Interpretation:
- this is cleaner localized transfer evidence than the broader lightly domain-shifted wrapper probe
- exact no-trunk and trained linear staying at the frozen-parent level keep the causal story intact
- because the slice is curated, it should be read as protected-slice evidence, not as a broad naturally sampled second-task benchmark
Source:
An additional exploratory probe was run on GSM8K under the same frozen-parent boundary.
Result:
- official
GSM8K: parent0.019712, current0.019712, no-trunk0.019712 - external dev
GSM8K: parent0.016064, current0.016064, no-trunk0.016064 - external blind
GSM8K: parent0.024096, current0.024096, no-trunk0.024096
Interpretation:
- no positive second-task gain is currently visible on
GSM8K - the present public object is still best described as a bounded
LogiQAspecialist surface - the next real multi-task step is not more wording, but a new second-task family and promotion path
Source:
Current public exact ablation:
- remove the promoted trunk blocks from the current checkpoint
- replay the same authoritative and external pipelines
Result:
- official
LogiQAreturns from0.392523to0.303738 external_devreturns from0.308908to0.304598external_blindremains0.425072
Interpretation:
- the public gain localizes to the promoted trunk blocks rather than generic checkpoint drift
Source:
Current public paired replay versus the no-trunk / parent boundary:
- official
LogiQAdelta= +0.088785 - official improved only
= 66 - official harmed only
= 9 - official bootstrap
95% CI = [0.063863, 0.115265] - official exact McNemar
p = 7.658840702050266e-12 - external dev delta
= +0.004310 - external blind delta
= 0.000000
Interpretation:
- the official gain is statistically strong
- external behavior does not show an obvious published collapse
Source:
The public repo now also ships thirteen extra LogiQA-local controls:
| Control | Official LogiQA | External Dev | External Blind | Interpretation |
|---|---|---|---|---|
| current surface, published-eval route-disabled proxy | 0.244548 |
0.261494 |
0.322767 |
disabling route selectivity on the published evaluation surface causes a large collapse |
| current surface, no topology gate | 0.280374 |
0.295977 |
0.371758 |
removing parent topology conditions substantially harms both official and external behavior |
| current surface, depth = 1 | 0.323988 |
0.304598 |
0.423631 |
one recurrent step keeps only part of the gain |
| frozen parent + BC low-rank adapter control | 0.303738 |
0.304598 |
0.425072 |
architecture-near adapter control does not improve over the frozen parent |
| frozen parent + DBB-BC low-rank adapter control | 0.303738 |
0.304598 |
0.425072 |
architecture-near adapter control does not improve over the frozen parent |
| frozen parent + BC target-only diagnostic control | 0.308411 |
0.304598 |
0.425072 |
target-derived blocks alone give only a small gain and do not match support-derived family performance |
| frozen parent + DBB-BC target-only diagnostic control | 0.308411 |
0.304598 |
0.425072 |
target-derived blocks alone give only a small gain and do not match support-derived family performance |
| frozen parent + trained linear-readout baseline | 0.300623 |
0.303161 |
0.425072 |
a classic trained same-parent linear control does not reproduce the promoted-surface gain |
| frozen parent + trained BitFit option-bias baseline | 0.266355 |
0.297414 |
0.412104 |
a same-parent BitFit-style bias-only comparator is also weaker than the frozen parent |
| frozen parent + trained LoRA-style hash-delta baseline | 0.303738 |
0.304598 |
0.425072 |
a same-parent LoRA-style factorized hash-delta comparator remains at the frozen-parent level |
| frozen parent + trained low-rank adapter baseline | 0.303738 |
0.304598 |
0.425072 |
a same-parent trained low-rank adapter-style PEFT comparator also stays at the frozen-parent level |
| frozen parent + budget-matched LoRA-style baseline | 0.303738 |
0.304598 |
0.425072 |
a trainable-budget-matched same-parent LoRA-style comparator also stays at the frozen-parent level |
| frozen parent + budget-matched low-rank adapter baseline | 0.303738 |
0.304598 |
0.425072 |
a trainable-budget-matched same-parent low-rank adapter comparator also stays at the frozen-parent level |
| frozen parent + retrieval-only baseline | 0.291277 |
0.287356 |
0.432277 |
a classical retrieval-only comparator is weaker than the frozen parent on official and external dev |
| frozen parent + lexical-only baseline | 0.289720 |
0.260057 |
0.273775 |
a classical lexical-only comparator is substantially weaker than the frozen parent |
Interpretation:
- route selectivity is necessary rather than optional
- parent topology constraints are necessary, not cosmetic
- recurrence depth matters for the current surface
- support-derived families are materially stronger than target-only diagnostic blocks
- retrieval-only, lexical-only, trained linear, trained BitFit, trained LoRA-style hash-delta, trained low-rank adapter, and trainable-budget-matched PEFT comparators all fail to reach the accepted host surface
- simply attaching same-parent low-rank adapter-style family controls does not reproduce the accepted host-surface gain
Source:
The public repo now exposes a three-family causal sequence built from public BC, DBB-BC, and AD families.
Forward addition from the frozen parent:
BC: official0.308411BC + DBB-BC: official0.313084BC + DBB-BC + AD: official0.317757
Reverse removal from the current surface:
current - AD: official0.387850current - AD - DBB-BC: official0.383178current - AD - DBB-BC - BC: official0.378505
Order-swap result on the published evals:
DBB-BC -> BC -> AD: official0.317757AD -> BC -> DBB-BC: official0.317757
Interpretation:
- the three public families produce a monotone gain when added from the frozen parent
- removing the same three families from the current surface causes a monotone loss
- for this small published three-family subset, order swap did not change the published official or external scores
Source:
Representative public locality evidence:
official_bc_zeroinit_routes, narrow probe: targets fixed3 / 3, off-target fires0, collateral flips0official_bc_zeroinit_routes, protected slice: targets fixed3 / 3, off-target fires0, collateral flips0official_dbb_bc_support_b1_additive, narrow probe: targets fixed3 / 3, off-target fires0, collateral flips0official_dbb_bc_support_b1_additive, protected slice: targets fixed3 / 3, off-target fires0, collateral flips0
Wider public locality context:
official_ad_support_b1_additive, narrow probe: targets fixed3 / 3, off-target fires0, collateral flips0official_ad_support_b1_additive, protected slice: targets fixed3 / 3, off-target fires6, collateral flips0
Interpretation:
- the tightest representative evidence supports the family-local repair story
- the wider probe set also shows that not every family is equally narrow
Source:
Current public negative boundary:
- clean
holdout2 = 0.400000 blind_remainder = 0.448133- additional raw release-boundary audits are available under ../results/audit
Interpretation:
- the public line is bounded and specialized
- it should not be sold as broad unseen-family reasoning generalization
Sources:
This repo now publishes:
- exact no-trunk ablation
- route-disabled, topology-removal, and depth-one controls
- target-only diagnostic controls
- retrieval-only, lexical-only, and trained linear classic comparators
- trained same-parent BitFit option-bias comparator
- trained same-parent LoRA-style factorized hash-delta comparator
- trained same-parent low-rank adapter-style comparator
- trained same-parent trainable-budget-matched LoRA-style comparator
- trained same-parent trainable-budget-matched low-rank adapter comparator
- architecture-near same-parent low-rank adapter controls
The current public matched-budget protocol uses the promoted trunk numeric payload as the target budget. Under that protocol:
- the trainable-budget target is
3123published trunk numeric payload values - the matched LoRA-style control uses
rank = 1,hash_dim = 3123, trainable parameter count= 3124 - the matched low-rank adapter control uses
rank = 1,hash_dim = 3123, trainable parameter count= 3124 - both matched controls replay exactly at the frozen-parent metric level
So the current public claim is:
- evidence for a constrained auditable capability-upgrade workflow
It is not:
- a blanket dominance claim over all nearby PEFT or local-editing methods
That limitation is intentional and visible.
Published open-input evidence now has two separate tracks:
- host-side structured open-input loop
- board-side
ESP32-C3narrow open-input micro-loop
Current board-side micro-loop result:
- samples
= 36 - exact match
= 1.0 - nonempty rate
= 1.0 - stability rate
= 1.0
Source:
- ../results/open_input_demo/host_open_input_demo.json
- ../results/open_input_demo/mcu_open_input_demo.json
The public repo now also includes a narrow board-side comparator bundle on the same ESP32-C3 fixed 642-sample path:
| Board artifact | Accuracy | Correct | Host full match |
|---|---|---|---|
| current host surface | 0.387850 |
249 / 642 |
642 / 642 |
| frozen parent | 0.302181 |
194 / 642 |
642 / 642 |
| parent trained linear baseline | 0.299065 |
192 / 642 |
642 / 642 |
Interpretation:
- the current host surface keeps a clear board-side margin over the frozen parent
- a simple trained same-parent linear baseline also fails on the real board path
- all three artifacts preserve exact host-full decision alignment on the compiled batches
Scope boundary:
- this board comparator bundle is intentionally narrow
- stronger route-heavy PEFT-style baselines are published on host-side evals, but they do not currently fit the public
ESP32-C3artifact budget
Source: