Phase 4 — Evaluation and write-up

Month 4. Compare the trained decomposition against every reasonable baseline, produce the critical causal test, and write the paper.

Deliverable: a paper draft (ideally ready for arXiv / workshop submission), a public code release, and a short summary of what was and was not learned.

Nav: ← Phase 3 · Implementation · Proposal · Glossary

Step 1 — Baselines, precisely specified

For every comparison I need an apples-to-apples counterpart that ran on the same activations and has a defensible hyperparameter choice. No baseline is "the one the paper used" with different data.

Baseline A — Per-layer SAE.

On Pythia-410m, use public Pythia SAE checkpoints where available. If the available checkpoints do not cover the layers I need, train minimal SAEs on the exact activations I used, matching width to the crosscoder baseline.
Metrics: reconstruction MSE on $v_\ell$ at each layer, L0 sparsity, Pareto curve across L1 penalties.
For the comparison to my method (which predicts $\Delta_\ell$), I also need the per-layer SAE's implied prediction of $\Delta_\ell$: fit SAEs at layers $\ell$ and $\ell+1$, compute reconstructed updates $\hat v_{\ell+1} - \hat v_\ell$, evaluate. This is an indirect comparison — the SAE wasn't trained for this — and the bias runs against the SAE: $\hat v_{\ell+1} - \hat v_\ell = \Delta_\ell - (\epsilon_{\ell+1} - \epsilon_\ell)$ where $\epsilon_\ell = v_\ell - \hat v_\ell$ is the per-layer SAE reconstruction residual, so the differencing adds two independent error terms onto the quantity the comparison scores. Report the raw $v_\ell$ reconstruction quality alongside the $\Delta_\ell$ baseline and be upfront about the bias direction in the write-up.

Baseline B — Crosscoder / MLSAE.

If public crosscoder weights for Pythia-410m exist, use them. Otherwise, replicate the published recipe (Lindsey et al. 2024; Lawson et al. 2024) on the same activations with the same dictionary width.
Metrics: direct reconstruction of $\Delta_\ell$. The two methods are duals of each other on the activation/decoder axis (see the glossary entry for crosscoder and for MLSAE) and consequently have different scoring formulas — do not collapse them.
- Crosscoder (one encoder, many decoders). Shared activation $a^f$ per feature per example; per-layer decoders $d_\ell^f$. Per-layer reconstruction $\hat v_\ell = \sum_f a^f d_\ell^f$, implied update $\hat\Delta_\ell = \sum_f a^f (d_{\ell+1}^f - d_\ell^f)$ — shared activation times decoder difference.
- MLSAE (shared dictionary, per-layer activations). Per-layer activations $a_\ell^f$ (the property the glossary highlights — "individual latents are often active on only one layer for a given token" would not be a meaningful claim if activations were shared across layers); shared decoder $d^f$ from the shared dictionary. Per-layer reconstruction $\hat v_\ell = \sum_f a_\ell^f d^f$, implied update $\hat\Delta_\ell = \sum_f (a_{\ell+1}^f - a_\ell^f) d^f$ — activation difference times shared decoder.
- Crosscoder variants with both per-layer activations and per-layer decoders (e.g. weak-causal crosscoder variants in Lindsey et al. 2024). Score with the fully general form $\hat\Delta_\ell = \sum_f (a_{\ell+1}^f d_{\ell+1}^f - a_\ell^f d_\ell^f)$ and record the variant explicitly in the paper's methods section, since the formula above for the standard ("acausal") crosscoder no longer applies.
The error structure of all three forms is the same as Baseline A's per-layer SAE: $\hat\Delta_\ell = \Delta_\ell - (\epsilon_{\ell+1} - \epsilon_\ell)$ where $\epsilon_\ell = v_\ell - \hat v_\ell$, so the differencing introduces two error terms onto the quantity scored. Report raw $v_\ell$ reconstruction quality alongside the $\Delta_\ell$ baseline and be upfront about the bias direction in the write-up.

Baseline C — Tuned Lens.

Public Tuned Lens implementation. Report its per-layer predicted output distribution and its KL divergence from the final-layer distribution as a function of depth.
This is a different-kind-of-baseline: it does not produce features, but it quantifies how fast information "settles" with depth, which is a natural complement to the effective-depth results from Phase 2.

Baseline D — Cross-Layer Transcoder (CLT).

The primary conceptual comparison. Proposal §4.2 positions the hypothesis as a shared-kernel reparameterization of CLT's $O(L^2)$ decoder matrices, and the H2-holds outcome is explicitly "my decomposition reconstructs comparably to CLT" — which requires a CLT reconstruction number to compare against, not just Crosscoder/MLSAE.
If public CLT weights for Pythia-410m exist at evaluation time, use them. Otherwise replicate the Ameisen et al. 2025 recipe at a scoped-down size on the same activations: match the dictionary width used by my method, and stop training once the Pareto frontier is saturated at the sparsity levels I will report.
Run two CLT variants: (a) native CLT writing into MLP outputs, and (b) a residual-stream-write variant whose decoder contributions are summed into $\Delta_\ell$ rather than into the MLP-block output. Per proposal §4.2, native CLT targets MLP outputs while my decomposition targets $\Delta_\ell$ (which folds in attention), so reporting both variants is what makes the comparison scope-honest — the "residual-stream-write variants" language in §4.2 points to exactly this.
Metrics: MLP-output reconstruction MSE for (a), $\Delta_\ell$ reconstruction MSE for (b), L0 sparsity, and a Pareto curve across L1 penalties for each variant. Report the parameter count of each variant alongside the metrics so the shared-kernel compression ratio relative to CLT's $O(L^2)$ budget is visible.
If compute prevents training even the scoped-down CLT before the Month-4 deadline, record this as an explicit limitation in the paper rather than silently dropping the comparison; frame Baseline B (Crosscoder / MLSAE) as the closest feasible proxy, since Crosscoder shares CLT's per-layer-decoder structure without the depth-sharing prior and lives in the same family for this comparison.

My method. The decomposition from Phase 3, at the sparsity points chosen on its Pareto curve.

Headline plot. Pareto curve of L0 vs. $\Delta_\ell$ reconstruction MSE, one curve per method, averaged across layers (and shown per layer in the appendix).

Step 2 — Stability and robustness metrics

For each method, report:

Seed stability. Train 3 seeds, compute feature-matching between the top-100 features across seeds (max cosine similarity on decoder directions). My method's number must be in the write-up even if it is worse than the baselines; hiding it is disqualifying.
Dataset stability. Same, but across two disjoint data halves.
Cross-class generalisation. Features trained on geography relations: how well do they reconstruct date-relation updates? (This mirrors Phase 2 Step 4.)

Step 3 — The causal existence proof (the critical result)

The headline scientific claim of the project is not "my method reconstructs better." It is: "my method recovers features that other methods do not, and those features have testable causal effects on model behaviour." The rest of the comparisons are context; this is the thing that must work.

Procedure.

Find candidate features. Enumerate every feature $f$ in my decomposition. Under Branch A / C (convolutive, possibly segmented) this is each (source-layer $\ell'$, latent-index) pair from the per-layer source dictionaries — each such pair is a distinct feature because the source $s_{\ell'}$ and the source $s_{\ell''}$ for $\ell' \ne \ell''$ live in separate dictionaries even when they share the same latent index. The total Branch A/C candidate count is $L \cdot M$ for Branch A (uniform $M$ enforced by Phase 3 Step 2) and $\sum_{\text{seg}} L_{\text{seg}} \cdot M^{(\text{seg})}$ for Branch C (per-segment $M^{(\text{seg})}$ allowed by Phase 3 Step 2; collapses to $L \cdot M$ when all segments share width). Under Branch B (state-space) a feature is a (source layer $\ell^*$, input direction $m \in \lbrace 1, \ldots, M_{\ell^*}\rbrace$) pair — the $m$-th coordinate of the layer-$\ell^*$ input $u_{\ell^*} \in \mathbb{R}^{M_{\ell^*}}$ — paralleling Branch A's (source layer $\ell'$, latent index $m$). A feature's contribution reaches $\Delta_\ell$ through two routes: directly at the home layer via the feedthrough $D_{\ell^*} u_{\ell^*}$, and indirectly at later layers via $B_{\ell^*}$ into $z_{\ell^*+1}$ and $A$-propagation onward through $C_\ell z_\ell$. Both routes are zeroed automatically by setting $u_{\ell^*}[m] = 0$ and running the SSM forward (see Step 3 item 4). State-only directions — directions in $z_\ell$ not expressible as a single input coordinate — are captured implicitly: every $z_\ell$ is a function of past inputs via $A$ and $B$, so zeroing the input that drove a given state feature removes the state-side contribution too. The total Branch B candidate count is $\sum_{\ell^* = 0}^{L-1} M_{\ell^*}$ — equal to $L \cdot M$ when widths are uniform across layers (the default under SAE-init recipes that match a single public SAE width), and equal to $\sum_{\ell^*} M_{\ell^*}$ otherwise. Reporting the unified novelty fraction. When all candidate counts collapse to $L \cdot M$ (uniform-width Branches A, B, and C), the paper can report a single unified novelty fraction across branches; when widths vary (Branch B with per-layer $M_{\ell^*}$ or Branch C with per-segment $M^{(\text{seg})}$), normalise by the actual candidate count and flag the per-branch denominator in the headline table. Then match each feature against every per-layer SAE, crosscoder, and MLSAE feature (Baselines A and B): (a) cosine similarity of decoder directions, and (b) Pearson correlation of the per-example activation patterns on the evaluation set. Decoder-match pre-registration. A feature in my method has multiple decoder directions — for Branch A/C, $W_0[:, m], W_1[:, m], \ldots, W_K[:, m]$ (one per lag, so one per target layer $\ell \in \lbrace \ell', \ldots, \min(\ell'+K, L-1)\rbrace$); for Branch B, the home-layer feedthrough column $D_{\ell^*}[:, m]$ plus, at each $\ell \gt \ell^*$, the composed column $C_\ell \cdot \left(\prod_{\ell^*\lt k\lt\ell} A_k\right) \cdot B_{\ell^*}[:, m]$ — read $\prod$ here under the decreasing-index left-to-right convention established in Phase 3 Step 1's Branch B paragraph, so the explicit composition is $C_\ell A_{\ell-1} A_{\ell-2} \cdots A_{\ell^*+1} B_{\ell^*}[:, m]$ (empty product = identity at $\ell = \ell^*+1$). Baselines have one decoder direction per (feature, layer). For each (my feature, target layer $\ell$) pair where the contribution is nonzero, compute cosine similarity against every (baseline feature, same target layer $\ell$) decoder; then take the max across all target layers as the feature's criterion-(a) match score. This is the strictest version: a feature is a duplicate if any of its target-layer decoders aligns with any baseline's decoder at the same target layer. Pre-register this "max-across-target-layers" rule before touching evaluation data — do not fall back to "home layer only" or "whichever layer gives the lowest match" at evaluation time. Define a "novel" feature as one whose maximum match against every non-CLT baseline feature falls below a threshold on both criteria. The thresholds are one-sided hypothesis tests of "this feature is a duplicate" against the null "this feature matches baselines at the level a random feature would," and they are pre-registered before the evaluation data is touched. One null per criterion:
- Decoder-cosine null. The null must mimic the real-feature scoring procedure exactly, otherwise the threshold is calibrated against a different search space than the candidates it is applied to. For the candidate feature being tested, let $\mathcal{T} = \lbrace \ell^{\text{(min)}}, \ldots, \ell^{\text{(max)}}\rbrace$ be the set of target layers where the candidate has a nonzero decoder direction (Branch A/C: $\lbrace \ell', \ldots, \min(\ell'+K, L-1)\rbrace$; Branch B: $\lbrace \ell^*, \ldots, L-1\rbrace$). Draw ~1,000 synthetic-feature null draws. Each draw consists of $|\mathcal{T}|$ independent unit vectors uniformly distributed on the sphere in $\mathbb{R}^D$, one per target layer in $\mathcal{T}$. For each draw, compute the max-across-target-layers score by exactly the rule item 1's pre-registration uses on real features: for each $\ell \in \mathcal{T}$, take the max cosine of the draw's $\ell$-th random vector against every non-CLT baseline decoder at the same target layer $\ell$; then take the max across target layers in $\mathcal{T}$. (Cosine similarity is scale-invariant, so the random vectors do not need norm-matching — only direction-matching, which the unit-sphere uniform draws give automatically.) Set the decoder-cosine threshold at the 95th percentile of the resulting per-draw scores — a real feature whose max-cosine is above the threshold matches some baseline more strongly than a structure-matched random feature would (rejecting novelty); a real feature whose max-cosine falls at or below the threshold is consistent with the structure-matched random-direction null (novel). The polarity matters: genuinely novel features should have max-cosines indistinguishable from random, not unusually low. Because the search-space size depends on $|\mathcal{T}|$, candidates with different home layers get slightly different thresholds; record the per-candidate $|\mathcal{T}|$ and threshold in the evaluation table so the comparison is auditable. Direction of the bias. $|\mathcal{T}|$ is monotonic in how many target layers a feature reaches: under Branch A/C, $|\mathcal{T}| = \min(K+1, L-\ell')$, so features rooted at $\ell' \le L-K-1$ get the full $K+1$ targets and features rooted near the top get fewer; under Branch B, $|\mathcal{T}| = L - \ell^*$, so $\ell^* = 0$ gets $|\mathcal{T}| = L$ and $\ell^* = L-1$ gets $|\mathcal{T}| = 1$. The 95th-percentile threshold rises monotonically with $|\mathcal{T}|$ (more independent random draws get maxed → larger expected max), so early-source features face a more lenient novelty bar than late-source features. The threshold scaling correctly compensates for type-I error — under the null, both early and late features get called novel at the calibrated 95% rate — but it does not compensate for the test's power to detect non-novelty: a real non-novel feature whose max-cosine reflects a single genuine baseline match must clear a higher threshold at early layers than at late layers, so non-novel features at early layers are more likely to escape detection and be falsely called novel. The observed novelty rate is therefore inflated at early home layers relative to late, and the per-feature novelty rate is not directly comparable across home layers. Report novelty fraction stratified by home layer in the appendix and flag this bias in the limitations section, so a reader does not interpret a higher absolute novelty rate at early source layers as evidence the method is finding more new features there.
- Activation-pattern null. Draw ~1,000 samples by taking one of my method's feature activation patterns (a vector of $N$ activation values, one per evaluation example) and shuffling it across the $N$ examples (preserves the marginal activation distribution over examples, destroys the example-level correspondence that would let a baseline's activation pattern line up with this one). For each draw, compute the max Pearson correlation against every non-CLT baseline's activation pattern. Set the activation-pattern threshold at the 95th percentile of the resulting max-correlation distribution — a real feature whose max-correlation is above the threshold matches some baseline more strongly than the null would allow (rejecting novelty); below or at threshold → novel under this criterion.
A feature is novel iff its max-match is at or below threshold on both criteria. Log both thresholds and the resulting candidate count; do not retune either threshold to control the candidate count, and do not fall back to the legacy 0.5/0.5 pair if the null-derived thresholds produce zero candidates — that outcome is itself the evaluation result. Decoder-direction match catches same-direction features, activation-pattern match catches same-when-it-fires features, and requiring both to be low is the stricter, standard-practice definition.

Why CLT is excluded from this matching. CLT (Baseline D) is a feature-producing method and would be a natural entry in the matching set, but the proposal §4.2 positions my decomposition as a shared-kernel reparameterization of CLT by design: under H2 both methods place sources at every source layer, both read from the residual stream, and they should find mostly the same features. CLT overlap is the predicted outcome of the hypothesis, not a novelty failure. Counting CLT matches against novelty here would set an impossible bar for features that by construction should agree. The CLT comparison lives in Step 1 Baseline D — parameter parsimony at comparable reconstruction — and in the discussion section of the paper, not in this novelty test.
Filter by consistency. A novel feature must also be seed- and dataset-stable (Step 2 stability metrics). A feature that only appears in one seed is noise.
Choose specific candidates. From the stable novel features, find one that activates most strongly on a specific, semantically coherent set of factual-recall examples — e.g., "geography-of-European-capitals." Read the top-activating examples and write down a human-legible description. This is still interpretation-by-inspection, but it gives me a prediction: if this feature represents $X$, then ablating it should break recall of $X$ and not unrelated things.
Ablate. Run the model with the feature's estimated contribution subtracted from the residual stream (the mechanism is subtraction of an estimated contribution, not literal zero-replacement of the real $\Delta_\ell$ — the decomposition is approximate, so zero-replacing would also zero the reconstruction error). The exact recipe depends on the Phase 3 branch taken:
- Branch A (stationary convolutive) or Branch C (segmented): at inference, re-solve the Phase 3 Step 3 LASSO source update on this example with the frozen trained decoder (Branch A: kernel $\lbrace W_k\rbrace$ shared across depths plus per-layer biases $\lbrace b_\ell\rbrace$; Branch C: per-segment kernel $\lbrace W_k^{(\text{seg})}\rbrace_{\text{seg}}$ plus the same per-layer biases $\lbrace b_\ell\rbrace$ as Branch A — biases stay per-layer in Branch C, only the kernel is per-segment, per Phase 3 Step 1's Branch C inheritance from Branch A) — not the SAE encoder alone. Per Phase 3 Step 3, this LASSO is a single coupled solve over the entire stacked source vector $(s_0, s_1, \ldots, s_{L-1})^{(i)}$ per example — it does not solve for $s_{\ell'}$ alone. The home-layer source $s_{\ell'}$ that contains the candidate's coordinate $m$ is one block of the joint solution; the rest of the stack is solved alongside (and frozen at the same fixed point the decoder was trained against). The LASSO is the path the decoder was trained against; the SAE encoder is cheaper but drifts away from the LASSO fixed point during alternating minimisation, and that drift propagates into the feature contribution estimate. Run the convolutive decoder forward twice — once with the unedited source (feature firing naturally) and once with the candidate's coordinate zeroed — to obtain two reconstructed update streams labelled by which version contains the feature: $\lbrace \hat\Delta_\ell^{\text{natural}}\rbrace$ (feature intact) and $\lbrace \hat\Delta_\ell^{\text{ablated}}\rbrace$ (feature removed). Define the estimated per-layer feature contribution with a fixed "with-feature minus without-feature" sign convention: $\delta_\ell \equiv \hat\Delta_\ell^{\text{natural}} - \hat\Delta_\ell^{\text{ablated}}$ (i.e. "what the feature added to the baseline reconstruction"). During the model's actual forward pass, subtract $\delta_\ell$ from the real $\Delta_\ell$ at every layer $\ell \in \lbrace \ell', \ldots, \min(\ell'+K, L-1)\rbrace$ where the feature contributes, so the model sees the estimated feature contribution removed from the residual stream while the rest of $\Delta_\ell$ — including the decomposition's reconstruction error — remains untouched.
- Branch B (state-space): at inference, compute the input sequence $\lbrace u_\ell\rbrace$ for this example by re-solving the Branch B input update (the coupled LASSO from Phase 3 Step 1) with the frozen trained parameters $\lbrace A_\ell, B_\ell, C_\ell, D_\ell, b_\ell, z_0\rbrace$ — same rationale as the Branch A case (SAE-encoder-only would drift from the trained fixed point). Run the SSM forward twice — once with the unedited input sequence (feature firing naturally) and once with $u_{\ell^*}[m]$ zeroed at its home layer $\ell^*$ — to obtain the labelled streams $\lbrace \hat\Delta_\ell^{\text{natural}}\rbrace$ (feature intact) and $\lbrace \hat\Delta_\ell^{\text{ablated}}\rbrace$ (feature removed). The SSM propagates the ablation automatically: $u_{\ell^*}[m]$ enters $\hat\Delta_{\ell^*}^{\text{natural}}$ directly via the feedthrough $D_{\ell^*}$, and enters $\hat\Delta_\ell^{\text{natural}}$ for $\ell \gt \ell^*$ through the state chain $B_{\ell^*} u_{\ell^*} \to z_{\ell^*+1} \to \ldots \to z_\ell \to C_\ell z_\ell$. Define $\delta_\ell$ with the same sign convention as Branch A: $\delta_\ell \equiv \hat\Delta_\ell^{\text{natural}} - \hat\Delta_\ell^{\text{ablated}}$, nonzero for $\ell \ge \ell^*$. Subtract $\delta_\ell$ from the real $\Delta_\ell$ at every layer $\ell \ge \ell^*$ during the model's forward pass, same splice pattern as the Branch A recipe.
For the 100 examples where the feature activates most strongly and 100 control examples, measure the top-1 probability of the correct answer before and after ablation. Control-set pre-registration. Control examples are sampled uniformly from examples where the LASSO-derived activation at the home layer (Branch A/C: $s_{\ell'}[m]$; Branch B: $u_{\ell^*}[m]$) is exactly zero. If fewer than 100 zero-activation examples are available — an L1 threshold too soft to zero most non-firing examples — fall back to the 100 examples with activation magnitude below the 5th percentile of non-zero activations. Log which rule was used in the paper's methods/appendix section; do not resample controls to hit a target behavioural effect. What this test does and doesn't do. On the primary (exact-zero) branch, the ablation recipe gives $\delta_\ell = 0$ by construction (zeroing an already-zero coordinate is a no-op), so the control effect is trivially zero — this is a mechanism-sanity check that the subtract-$\delta_\ell$ splice preserves predictions when the feature is truly silent, not a specificity claim on its own. On the 5th-percentile fallback branch, coordinates are non-zero but small, so ablation removes small contributions and the threshold does real work. The real specificity claim — "ablating the candidate damages performance more than ablating other features of matched magnitude" — is carried by the random-feature test in item 5, not by this control test.
Compare to a matched random-feature ablation. Draw $N_{\text{rand}} = 10$ random features from the full decomposition dictionary, each satisfying the magnitude-matching constraint: its mean activation magnitude on the candidate's top-100 examples matches the candidate's own mean activation magnitude on those same examples (pre-register: mean of $|s_{\ell'}[m]|$ or $|u_{\ell^*}[m]|$ across the 100 examples, with a tolerance of ±20%). $N_{\text{rand}}$ is named distinctly from the kernel length $K$ used in items 1 and 4 to avoid the symbol collision; the random-feature count and the kernel length are unrelated quantities. Feasibility fallback. For a strongly-specific candidate, the candidate's top-100 may be narrow enough that no random feature in the dictionary fires there with matched magnitude. If fewer than $N_{\text{rand}}$ random features meet the ±20% tolerance after searching the full dictionary, widen to ±50% and record the relaxation explicitly in the paper. If still fewer than $N_{\text{rand}}$, report the random-feature test as "not applicable" for this candidate and flag that the magnitude-matched null is unreachable — do not silently pass the test on fewer draws and do not keep widening the tolerance. N/A handling in the reporting standard. When the random-feature test is N/A, skip this criterion and require the other three (large effect, controls preserved, symmetric activation) to still hold; mark the candidate's positive-result determination as "conditional on the random-feature test being unreachable" and note this explicitly in the paper. The reasoning: the random-feature test rules out "any magnitude-matched feature has the same effect," so if no matched features exist, there is nothing to rule out — the candidate's specificity is too extreme for this null rather than failing it. Do not promote N/A to an automatic pass of the criterion, and do not demote it to an automatic fail. Ablate each of the $N_{\text{rand}}$ random features on the same 100 high-activation examples from item 4 (the ones where the candidate fires most strongly, not a fresh top-100 for the random feature), and report the distribution of top-1 probability drops across the $N_{\text{rand}}$ ablations — mean and 95th percentile. The random-feature comparison in the reporting standard uses the mean across the $N_{\text{rand}}$ draws, not a single draw, because a single draw is too noisy (one unlucky random feature that happens to matter for the top-100 examples could cross the threshold even when the candidate is specific). Matching on the candidate's own top-100 (rather than on full-dataset averages) is the right null: a random feature that only fires elsewhere would trivially have no effect on these examples and wouldn't be a meaningful control.
Compare to activation. Symmetrically, "activating" the feature on examples where it is normally silent — does it shift model predictions in the predicted direction? Silent example set. Reuse the 100 control examples from item 4 (those with zero LASSO activation at the home layer) as the silent example set; do not redraw a separate silent pool. Source/input computation. Compute the silent-baseline source vector (Branch A/C) or input sequence (Branch B) for each silent example by re-solving the LASSO with the frozen trained decoder, as in item 4 — do not use the SAE encoder alone, for the same drift-avoidance reason. On the primary (exact-zero) branch of item 4's control set, the candidate's coordinate will come out at zero (that's what makes the example silent). On the 5th-percentile fallback branch (where the LASSO doesn't zero most non-firing coords), the candidate's coordinate is small but non-zero — in that case, explicitly zero the candidate's coordinate in the silent-baseline source/input before computing $\hat\Delta_\ell^{\text{silent}}$, so that the activation test measures "what injecting $\bar a$ from zero adds" rather than "the difference between $\bar a$-activation and a tiny non-zero baseline." $\delta_\ell$ is then genuinely the contribution of a from-zero injection. Activation procedure. Compute the candidate's mean activation magnitude $\bar a$ on its top-100 firing examples (same quantity used for magnitude-matching in item 5). On each silent example, set the feature's coordinate ($s_{\ell'}[m]$ for Branch A/C, $u_{\ell^*}[m]$ for Branch B) to $\bar a \cdot \text{sgn}(\bar a_{\text{signed}})$ where $\bar a_{\text{signed}}$ is the candidate's mean signed activation on its top-100 (distinct from $\bar a$, which is the mean of $|\cdot|$). For ReLU SAE encoders (the default under Pythia SAE checkpoints and most Gemma Scope releases) all activations are non-negative, so $\text{sgn}(\bar a_{\text{signed}}) = +1$ and the injection is simply $+\bar a$. For non-ReLU SAEs (e.g. top-$k$ SAEs, where latents can be either sign), $\bar a_{\text{signed}}$ can be negative and the injection flips sign accordingly — this captures the direction of the feature's "typical firing" rather than its magnitude alone. Pre-register the mean-signed-activation rule before the activation test is run; do not switch to per-example sign matching (which would be a different test). Run the convolutive decoder or SSM forward twice — once with the silent (unedited) source/input and once with the activated source/input — to obtain $\lbrace \hat\Delta_\ell^{\text{silent}}\rbrace$ (feature absent) and $\lbrace \hat\Delta_\ell^{\text{activated}}\rbrace$ (feature injected). Using the same "with-feature minus without-feature" sign convention as item 4, the feature contribution is $\delta_\ell \equiv \hat\Delta_\ell^{\text{activated}} - \hat\Delta_\ell^{\text{silent}}$ — "what the injected feature adds," still a positive contribution. Add $\delta_\ell$ to the real $\Delta_\ell$ at every layer where the feature contributes. The ablation splice in item 4 subtracts a feature contribution to remove it; this activation splice adds a feature contribution to inject it. Same definition of $\delta_\ell$ (with-feature minus without-feature), opposite splice sign depending on whether we're removing or injecting.

Reporting standard. Report the ablation effect, the control-examples effect, the random-feature effect, and the activation effect together. Pre-registered numerical thresholds for a positive result (all four must hold):

Large effect on relevant examples. Mean top-1 probability of the correct answer drops by at least $0.2$ absolute on the candidate's top-100 examples, with a paired $t$-test $p \lt 0.01$ against the unablated baseline (Bonferroni-corrected across the number of candidate features ablated).
Controls preserved (mechanism-sanity check). Mean top-1 probability drops by at most $0.05$ absolute on the 100 control examples (those with zero LASSO activation, per item 4's control-set pre-registration). On the primary exact-zero branch this is trivially satisfied — the check exists to catch numerical/implementation bugs in the ablation splice, not to demonstrate specificity. Specificity is established by the random-feature ablation below, not by this bullet.
Small effect of random ablation. Mean top-1 probability drop across the $N_{\text{rand}} = 10$ magnitude-matched random-feature ablations (per item 5) is at most $0.05$ absolute on the same candidate top-100; the 95th-percentile random-feature drop is also reported alongside for transparency but the threshold is on the mean. Caveat on the 95th-percentile column. With $N_{\text{rand}} = 10$ the reported "95th percentile" is essentially the maximum draw (interpolated between the 9th and 10th order statistics) rather than a stable quantile estimate. Treat it as a worst-case-of-10 sanity number, not a reliable tail statistic; if it disagrees sharply with the mean, that is a noise-in-the-null signal, not evidence of a heavy-tailed effect distribution. Increase $N_{\text{rand}}$ before reporting tail statistics if the disagreement matters for the conclusion.
Consistent symmetric activation. Activating the feature (per item 6) on the 100 silent examples (item 4's control set) increases the log-probability of the semantic token(s) associated with the feature's description by at least $1.0$ nat (natural-log units) on at least 70% of those 100 examples (i.e. at least 70 examples). The semantic token set is pre-specified from item 3's human-legible description before the activation test is run — e.g. for "European capitals," the token set is { " Paris", " London", " Berlin", " Madrid", " Rome", ...} (with leading spaces, because the natural-language context "The capital is …" puts those tokens in the with-space position under Pythia's GPT-NeoX tokeniser). Tokenisation rule (pre-registered). Restrict the semantic token set to strings that exist as single tokens in the model's tokeniser, with leading space. If a concept has no single-token realisations under the tokeniser (e.g. "San Francisco" is multi-token in GPT-NeoX) or fewer than five single-token realisations, pick a different concept during item 3 — do not fall back to summing multi-token log-probabilities, which would introduce a per-candidate normalisation choice the pre-registration doesn't fix. The metric is the max log-probability over the token set. Using log-prob of a pre-specified token set rather than "top-1 shifts to the expected answer" avoids the discretisation problem (the concept may show up as a near-miss that doesn't cross the top-1 boundary) and the vagueness of "predicted direction."

Any result failing one or more of these thresholds is a null result and must be reported as such. Do not relax the thresholds post hoc to salvage a marginal outcome — the "If no novel-and-causal feature is found" paragraph below is the honest landing place.

If no novel-and-causal feature is found. This is the honest outcome to report. The method recovers features, but none of them are genuinely new in the sense this test requires. Phase 4 then writes up: (i) the characterisation from Phases 1–2, (ii) the decomposition as a compression / interpretability comparison study, (iii) an explicit statement that the existence proof test failed. This is still a paper — a less exciting one, but an honest one.

Step 4 — Paper structure

Draft outline (tentative; the real outline depends on which branch of Phase 3 was taken and whether the causal test succeeded):

Introduction. Superposition → per-layer SAE → the cross-layer question. Position the contribution.
Related work. Per-layer SAE, crosscoders, MLSAE, CLT, JSAE, Tuned Lens. Motivation-only paragraph on EMG / blind source separation, kept to a page maximum.
Method — Phase 1 findings. Ridge, CCA, AR, Granger, null tests, context perturbation. Headline figures: $R^2$ heatmap and Granger profile with null comparison.
Method — Phase 2 findings. Stationarity, effective depth, cross-class and cross-model generalisation.
Method — the decomposition. Specification, optimisation, hyperparameter sweep.
Evaluation. Baseline Pareto curves, stability, causal existence proof.
Limitations. Honest section: nonlinear layers break Toeplitz-as-derivation, factual recall is a specific circuit family, results may not transfer to reasoning, SAE-initialised per-layer sources $\lbrace s_{\ell'}\rbrace$ bias the result toward SAE-findable features.
Discussion. What the Granger profile and the effective depth say about how information propagates through transformer depth. Connection to CLT and circuit-tracing work.
Conclusion. Short.

Step 5 — Code release

A clean GitHub repo that reproduces the headline figures from raw activations. README.md with setup, a scripts/reproduce.sh that runs the full pipeline end-to-end on a toy dataset, and a notebook for exploring the feature dictionary. This is the thing future mentees will read to replicate — it must actually work on a fresh install.

Step 6 — Timeline within Month 4

Week 13: Baselines and Pareto curves (Step 1, 2).
Week 14: Causal existence proof (Step 3). The hardest week — this is where the result is made or unmade.
Week 15: Paper draft (Step 4). All sections to rough-complete.
Week 16: Figures finalised, code release prepared, internal review, revisions, submit to preprint / workshop venue.

Step 7 — What "done" looks like

Paper draft circulated for review at least one week before the four-month endpoint.
All figures regenerate from saved intermediates via scripts/reproduce.sh.
Code release public under a permissive license (MIT or Apache-2.0), with a model card documenting the activations, datasets, and seed configurations used.
A short retrospective (internal): what worked, what didn't, what I would do differently. This becomes the first memory for the next iteration of the research direction, if there is one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4 — Evaluation and write-up

Step 1 — Baselines, precisely specified

Step 2 — Stability and robustness metrics

Step 3 — The causal existence proof (the critical result)

Step 4 — Paper structure

Step 5 — Code release

Step 6 — Timeline within Month 4

Step 7 — What "done" looks like

FilesExpand file tree

phase-4-evaluation.md

Latest commit

History

phase-4-evaluation.md

File metadata and controls

Phase 4 — Evaluation and write-up

Step 1 — Baselines, precisely specified

Step 2 — Stability and robustness metrics

Step 3 — The causal existence proof (the critical result)

Step 4 — Paper structure

Step 5 — Code release

Step 6 — Timeline within Month 4

Step 7 — What "done" looks like