perf: LayerWorkspace arena allocator for zero-allocation forward passes (#1014)#1083
perf: LayerWorkspace arena allocator for zero-allocation forward passes (#1014)#1083
Conversation
… passes Step 1 of #1014: Core arena infrastructure. - ForwardArena<T>: bump-pointer allocator that pre-allocates Tensor<T> objects grouped by shape, dishes them out via array index increment (O(1), zero syscalls), and resets cursors at end of forward pass. Zero GC pressure. - LayerWorkspace<T>: per-layer facade with named buffer declarations (DeclareTimestepBuffer/DeclareSequenceBuffer) and shape-aware pre-sizing. Handles variable batch sizes and sequence lengths. - ShapeKey: value-type struct with FNV-1a hash for O(1) dictionary lookups. - LayerBase<T>: added protected Workspace property for opt-in arena usage. Next: migrate RWKV7Block (84 allocations -> 0) as proof of pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocation Benchmark results (RWKV7Block allocation pattern, batch=1, seqLen=32): | Method | Mean | Allocated | vs Baseline | |---------------------|------------|-----------|-------------| | new Tensor (raw) | 46,509 ns | 199 KB | 1.0x | | ForwardArena.Rent | 107,854 ns | 0 B | 0.4x | | LayerWorkspace | 200 ns | 0 B | 232x faster | | Single new Tensor | 171 ns | 584 B | | | Single Arena.Rent | 453 ns | 0 B | | | Single Workspace | 2.2 ns | 0 B | 77x faster | LayerWorkspace achieves zero allocation via index-based pre-allocated tensor array lookup (2.2ns per access). ForwardArena is slower than raw allocation due to Span.Clear() overhead — workspace is the production target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…inated) TimeMixingForward hot path: - 8 sequence buffers (allR, allK, allV, allA, allB, allWkv, allWkvPreGate, allWkvGated) now pre-allocated via Workspace.Sequence() - 5 per-timestep buffers (rInput, kInput, vInput, aInput, bInput) now pre-allocated via Workspace.Timestep() — reused across seqLen iterations Index-based buffer access: const int indices for O(1) array lookup (2.2ns). 71 allocations remain in deeper methods — continuing migration. Part of #1014. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All forward-path hot loop allocations eliminated: - TimeMixingForward: 8 sequence buffers + 6 timestep buffers = 14 via workspace - ChannelMixingForward: 4 sequence buffers + 3 timestep buffers = 7 via workspace - Total: 20 forward-path allocations → 0 (84 → 64 remaining, all cold-path/backward) Remaining 64 allocations are: - 24 constructor parameter init (cold path, one-time) - 24 gradient accumulator init (cold path, one-time) - 2 state fallbacks (conditional, first-call only) - 14 backward pass (training only) Production-safe: uses Ws property with null-check throw instead of null-forgiving operator. No ! operators used. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Deployment failed with the following error: Learn More: https://vercel.com/franklins-projects-02a0b5a0?upgradeToPro=build-rate-limit |
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a bump‑pointer arena ( Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Layer as RWKV7Block
participant Ws as LayerWorkspace
participant Arena as ForwardArena
participant Alloc as TensorAllocator
App->>Layer: Forward(batchSize, seqLen)
Layer->>Ws: BeginForward(batchSize, seqLen)
Ws->>Arena: EnsureCapacity / Rent shapes (optional)
Arena-->>Ws: Tensors populated in workspace slots
Ws-->>Layer: Workspace tensors ready
loop per-timestep
Layer->>Ws: Timestep(index) / Sequence(index)
Ws-->>Layer: Tensor<T> (preallocated slot)
Layer->>Layer: compute using buffer
end
alt fallback path
Layer->>Alloc: Rent(shape)
Alloc-->>Layer: Tensor<T> (rented)
end
Layer-->>App: Forward result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR introduces a per-layer “arena/workspace” mechanism intended to eliminate per-forward-pass Tensor<T> allocations in neural-network layer hot paths, and migrates RWKV7Block forward intermediates to use it. It also adds a BenchmarkDotNet benchmark to measure allocation/lookup overhead of the new strategies.
Changes:
- Add
LayerWorkspace<T>(index-based preallocated buffers) andForwardArena<T>(shape-keyed slab allocator) underAiDotNet.Memory. - Add
LayerBase<T>.Workspaceto support opt-in per-layer workspace usage. - Update
RWKV7Blockto reuse workspace buffers for forward intermediates and add an allocation benchmark.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/NeuralNetworks/Layers/SSM/RWKV7Block.cs |
Uses LayerWorkspace<T> to eliminate forward-path intermediate tensor allocations. |
src/NeuralNetworks/Layers/LayerBase.cs |
Adds a protected Workspace property for layers to opt into workspace allocation. |
src/Memory/LayerWorkspace.cs |
New index-based per-layer buffer store for timestep/sequence tensors. |
src/Memory/ForwardArena.cs |
New arena allocator with shape-keyed slabs and cursors. |
AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs |
Adds benchmarks comparing raw allocation vs arena vs workspace lookup. |
Comments suppressed due to low confidence (1)
src/NeuralNetworks/Layers/LayerBase.cs:61
- The XML doc comment here has two opening
tags in a row with only one closing tag, which can produce malformed XML doc warnings (and this repo treats warnings as errors). Remove the extra
/// <summary>so the comment is well-formed.
protected LayerWorkspace<T>? Workspace { get; set; }
/// <summary>
/// <summary>
/// Gets the element-wise activation function for this layer, if specified.
/// </summary>
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs`:
- Around line 119-145: The benchmark Workspace_RWKV7Pattern currently omits the
workspace shape-init call; call _workspace.BeginForward with the correct batch
and sequence lengths inside the measured method so the measured path includes
the fast-path shape check (e.g. call _workspace.BeginForward(BatchSize,
TimestepsPerForward) at the start of Workspace_RWKV7Pattern before any
Sequence/Timestep calls), ensuring BeginForward() is executed on every benchmark
iteration and not only in setup.
In `@src/Memory/ForwardArena.cs`:
- Around line 131-168: The ShapeKey currently only encodes rank and up to four
dimensions, causing different rank>=5 shapes to collide; update ShapeKey
(constructor, GetHashCode, and Equals) to either incorporate all shape elements
into the stored state and FNV-1a hash (e.g., process shape[i] for i from
0..shape.Length-1) and compare all elements in Equals, or explicitly guard by
throwing an exception in the ShapeKey ctor when shape.Length > 4 to fail closed;
ensure the chosen approach updates both the hashing logic (_hash) and equality
logic (Equals(ShapeKey) and Equals(object)) consistently so ShapeKey uniquely
represents the entire shape.
In `@src/Memory/LayerWorkspace.cs`:
- Around line 32-34: The BeginForward method currently trusts batch/sequence
sizes via the fields _lastBatchSize/_lastSeqLen and can silently skip allocation
on first call (e.g., BeginForward(0,0)); modify BeginForward to validate input
parameters (batchSize and seqLen) up-front and either reject zero or negative
sizes by throwing ArgumentOutOfRangeException, or initialize sentinel cached
sizes so the first non-zero call forces allocation; also update any slot
accessor methods (e.g., GetSlot/SetSlot or indexers used for workspace slots) to
validate slot indices and throw ArgumentOutOfRangeException instead of letting
raw array indexing propagate IndexOutOfRangeException; ensure all uses of
_lastBatchSize/_lastSeqLen follow the new validation logic so the fast path
cannot skip required allocation.
In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 314-341: RWKV7Block<T> currently inherits LayerBase<T>.Clone()
which does a shallow MemberwiseClone(), causing the cloned instance to share the
same Workspace buffers; override Clone() in RWKV7Block<T> to call
MemberwiseClone() but then re-create a fresh LayerWorkspace<T> and re-run the
workspace declarations (the same sequence of Workspace = new
LayerWorkspace<T>(...) and Workspace.DeclareTimestep/DeclareSequence calls) so
the clone has independent buffers; factor the workspace declaration lines into a
private helper method (e.g., InitializeWorkspace()) and call it from both the
constructor and your new Clone() implementation to avoid duplication.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c0438f16-3cf3-4db6-ade6-d2c82e7e2ddc
📒 Files selected for processing (5)
AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cssrc/Memory/ForwardArena.cssrc/Memory/LayerWorkspace.cssrc/NeuralNetworks/Layers/LayerBase.cssrc/NeuralNetworks/Layers/SSM/RWKV7Block.cs
…ent across 50+ layers Systematic migration of batchSize-shaped tensor allocations from new Tensor<T>() to TensorAllocator.Rent<T>() across all NN layers. Before: 1722 raw allocations in forward paths After: 1498 raw allocations + 332 pooled TensorAllocator.Rent calls Eliminated: 224 per-forward-pass allocations (13% reduction) Key layers migrated: - S4DLayer: 7, S5Layer: 7, Mamba2Block: 12, MegalodonLayer: 7 - LinearRecurrentUnitLayer: 7, S6Scan: 15, ExtendedLSTMLayer: 12 - BASEDLayer: 10, RWKVLayer: 21, TransNormerLLMLayer: 8 - DeltaFormerLayer: 10, HedgehogLayer: 7, RebasedLayer: 8 - Plus 30+ additional SSM/attention/recurrent layers Remaining 1498 allocations are: - Constructor parameter init (weights, biases) — cold path - Gradient accumulators — training setup - Backward pass intermediates — training only - Fixed-shape constants Part of #1014. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… files Minor additions to the batch migration. Running totals for #1014: - NN Layers: 1498 raw + 332 pooled (was 1722 raw + ~100 pooled) - RWKV7Block: 20 hot-path allocations → LayerWorkspace (232x speedup) - All layers: 225 per-forward allocations → TensorAllocator.Rent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 54 out of 54 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs (1)
232-277:⚠️ Potential issue | 🟠 MajorBlocking: rented buffers make current recurrence/output writes non-deterministic.
Two issues:
- Line 258:
stateis read at t=0 without explicit initialization.- Lines 276-277: output uses
+=semantics even though each slot is produced once per head.💡 Suggested fix
- T prevState = state[new[] { bi, di, ki }]; + T prevState = t == 0 + ? NumOps.Zero + : state[new[] { bi, di, ki }]; ... - output[new[] { bi, t, flatK }] = NumOps.Add( - output[new[] { bi, t, flatK }], sum); + output[new[] { bi, t, flatK }] = sum;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs` around lines 232 - 277, The rented per-head state tensor (state) is used before being initialized and the output is being accumulated with += even though each head produces each output slot exactly once; to fix, explicitly zero-initialize the rented state immediately after Rent (the local variable state in the hi loop) and ensure output entries for the current head are either pre-zeroed once before the head loop or assigned (not added) when writing (change the output update in the inner loop that currently does output[...] = output[...] + sum to a direct assignment if you zeroed output per-head, or zero output once before the hi loop and keep the additive update), updating the code around the state allocation and the output write (refer to the variables state, gateVal, kVal, vVal, qVal, and output used in the nested loops).src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs (1)
455-470:⚠️ Potential issue | 🟠 MajorBlocking: aggregation now depends on implicit zeroed memory.
aggregated[...] = aggregated[...] + ...at Lines 468-470 requires a guaranteed zero initial buffer. Please compute into a local accumulator and assign once.💡 Suggested fix
- { - aggregated[new[] { b, kvh, s, d }] = NumOps.Add( - aggregated[new[] { b, kvh, s, d }], - expandedGrad[new[] { b, qh, s, d }]); - } + { + T sum = NumOps.Zero; + for (int g2 = 0; g2 < _headsPerGroup; g2++) + { + int qh2 = kvh * _headsPerGroup + g2; + sum = NumOps.Add(sum, expandedGrad[new[] { b, qh2, s, d }]); + } + aggregated[new[] { b, kvh, s, d }] = sum; + break; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs` around lines 455 - 470, The loop currently increments aggregated[...] in-place and relies on the rented buffer being zeroed; instead, for each combination of b, kvh, s, d compute a local accumulator (e.g. T acc = default(T)) and inside the g loop do acc = NumOps.Add(acc, expandedGrad[new[] { b, qh, s, d }]) and after the g loop assign aggregated[new[] { b, kvh, s, d }] = acc; update the nested loops around variables batchSize, _numKVHeads, _headsPerGroup, _headDimension and keep use of TensorAllocator.Rent and expandedGrad but avoid reading aggregated before assignment.src/NeuralNetworks/Layers/SSM/RodimusLayer.cs (1)
386-451:⚠️ Potential issue | 🟠 MajorBlocking: recurrent state must not rely on zeroed pooled memory.
Line 446 reads
stateon the first timestep. With rented buffers, explicitly handlet == 0(and seed cached t=0 state if backward expects it).💡 Suggested fix
- T prevS = state[new[] { bi, hi, di, ki }]; + T prevS = t == 0 + ? NumOps.Zero + : state[new[] { bi, hi, di, ki }];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs` around lines 386 - 451, The rented tensor "state" is pooled and may contain garbage, but the loop reads previous state at t==0 (prevS = state[...]) — explicitly initialize or seed the recurrent state for t==0 instead of relying on zeros in pooled memory: inside the outer time loop (for int t = 0; t < seqLen; t++) or before it, detect t==0 and either zero-fill the "state" tensor (using NumOps.Zero assignments) or set prevS to NumOps.Zero when t==0 and also write that seeded state into "state" and "allStates" (or initialize allStates[...,0,...]) so downstream code/backward passes see a defined t=0 state; update places that read prevS (the read at prevS = state[new[] { bi, hi, di, ki }]) to use the initialized value.src/NeuralNetworks/Layers/SSM/RWKVLayer.cs (1)
340-353:⚠️ Potential issue | 🟠 MajorRemove the unused
shiftedpass from the hot path.Line 340 rents
shifted, then Lines 341-353 fill it, but nothing ever reads it.rInput,kInput, andvInputare rebuilt fromx_t/xPrevbelow, so this is pure extra work inside the perf-sensitive timestep loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RWKVLayer.cs` around lines 340 - 353, The rented temporary tensor "shifted" is never read and should be removed from the hot timestep loop: delete the TensorAllocator.Rent<T>(...) call and all assignments to shifted[new[] { bi, d }] inside the nested loops, and also remove any later references or Return/Dispose calls for "shifted" if present; keep existing computations that build rInput/kInput/vInput from x_t and xPrev and retain use of _timeMixR/_timeMixK/_timeMixV so behavior is unchanged while removing the unused allocation and writes.
♻️ Duplicate comments (1)
src/NeuralNetworks/Layers/SSM/RWKV7Block.cs (1)
314-359:⚠️ Potential issue | 🔴 CriticalGive each
RWKV7Block<T>clone its own workspace buffers.These lines make
Workspacepart of the block’s mutable execution state. IfRWKV7Block<T>still inherits a shallowLayerBase<T>.Clone(), the clone and original will alias the same scratch buffers and cached sequence tensors, so a forward on one instance can corrupt the other. Move the declarations into anInitializeWorkspace()helper and call it from the constructor path and the block-specificClone()override.♻️ Suggested refactor
- // Arena workspace: pre-allocate forward pass buffers for zero-allocation hot path - Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12); - // TimeMixing timestep buffers - Workspace.DeclareTimestep(TsRInput, _modelDimension); - Workspace.DeclareTimestep(TsKInput, _modelDimension); - Workspace.DeclareTimestep(TsVInput, _modelDimension); - Workspace.DeclareTimestep(TsAInput, _modelDimension); - Workspace.DeclareTimestep(TsBInput, _modelDimension); - Workspace.DeclareTimestep(TsWkvOut, _modelDimension); - Workspace.DeclareTimestep(TsYt, _modelDimension); - // ChannelMixing timestep buffers - Workspace.DeclareTimestep(TsCmRInput, _modelDimension); - Workspace.DeclareTimestep(TsCmKInput, _modelDimension); - Workspace.DeclareTimestep(TsCmKSiLU, _ffnDimension); - // TimeMixing sequence buffers - Workspace.DeclareSequence(SqAllR, _modelDimension); - Workspace.DeclareSequence(SqAllK, _modelDimension); - Workspace.DeclareSequence(SqAllV, _modelDimension); - Workspace.DeclareSequence(SqAllA, _modelDimension); - Workspace.DeclareSequence(SqAllB, _modelDimension); - Workspace.DeclareSequence(SqAllWkv, _modelDimension); - Workspace.DeclareSequence(SqAllWkvPre, _modelDimension); - Workspace.DeclareSequence(SqAllWkvGated, _modelDimension); - // ChannelMixing sequence buffers - Workspace.DeclareSequence(SqCmAllRGate, _modelDimension); - Workspace.DeclareSequence(SqCmAllVProj, _modelDimension); - Workspace.DeclareSequence(SqCmAllSiLU, _ffnDimension); - Workspace.DeclareSequence(SqCmAllKProj, _ffnDimension); + InitializeWorkspace();private void InitializeWorkspace() { Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12); // existing DeclareTimestep / DeclareSequence calls... } // Call InitializeWorkspace() from this class's Clone() override too.Verify the clone path. Expected result:
LayerBase<T>still uses a shallow clone andRWKV7Block<T>has no override yet.#!/bin/bash set -euo pipefail echo "LayerBase clone implementation:" fd 'LayerBase.cs$' src --exec sed -n '1,260p' {} echo echo "RWKV7Block clone declarations:" rg -n '\bClone\s*\(' src/NeuralNetworks/Layers/SSM/RWKV7Block.cs || true echo echo "RWKV7Block workspace setup:" sed -n '314,360p' src/NeuralNetworks/Layers/SSM/RWKV7Block.cs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs` around lines 314 - 359, The Workspace is currently allocated in the constructor body making it shared when LayerBase<T>.Clone() does a shallow clone; move the allocation and all Workspace.DeclareTimestep/DeclareSequence calls into a new private InitializeWorkspace() method (create LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12) and call the same Declare* entries there) and invoke InitializeWorkspace() from the constructor and from the RWKV7Block<T>.Clone() override so each cloned block gets its own LayerWorkspace<T> instead of aliasing the original Workspace; update references to Workspace and keep the existing Ts*/Sq* constants unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs`:
- Around line 252-265: The rented recurrent buffers (cellState, normState,
allCellStates, allNormStates and any other rented tensors used as
recurrence/caches) must be zero-initialized before being read/accumulated;
update the code that rents these tensors to explicitly clear them (e.g. call the
tensor's zero/Fill/Clear method or set elements to default(T)) immediately after
TensorAllocator.Rent<T>(...) and ensure the initial timestep/cached slot
(allCellStates[...,0] and allNormStates[...,0]) is set to zeros so the first
timestep and cached state remain deterministic; apply the same zeroing to the
other rented recurrent buffers referenced in the forward/backward paths
(cellState, normState and the seq-length arrays).
In `@src/NeuralNetworks/Layers/SSM/HGRNLayer.cs`:
- Around line 320-324: The rented buffers can contain stale data so preserve the
zero initial hidden state by explicitly zero-initializing `h` and the timestep-0
slice of `allHidden` after calling `TensorAllocator.Rent<T>`; specifically, in
HGRNLayer (variables `h` and `allHidden`) set all elements of `h` to zero before
it is first read and copy/clear the [batch, 0, modelDim] slice of `allHidden` to
zeros (or write `h` into that slice) so the initial hidden-state snapshot
remains deterministic when using `TensorAllocator.Rent<T>`.
In `@src/NeuralNetworks/Layers/SSM/MEGALayer.cs`:
- Around line 374-375: The rented tensor "states" from
TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _emaDimension }) must
have its timestep-0 slice explicitly zeroed to ensure deterministic EMA startup;
modify the code after the Rent call to set states[..., 0, ...] (the t=0 slice
across batch and _emaDimension) to zeros (use the tensor fill/clear API or
Memory.Clear equivalent) so that subsequent reads like states[..., t, ...]
assume a defined t=0 EMA state.
- Line 417: The rented attnScores tensor in MEGALayer (created via
TensorAllocator.Rent<T>(...)) can contain stale upper-triangle values because
only j<=i entries are written; immediately clear or initialize the whole tensor
after renting to eliminate stale data. Replace the direct Rent<T> usage with a
Rent followed by a full clear/initialization (e.g.,
TensorAllocator.Clear<T>(attnScores) or a Fill with default(T)), or, if numeric
types are known (float/double), initialize masked positions to a large negative
value (e.g., -Infinity) so softmax will ignore them; ensure this happens before
any downstream read of attnScores.
In `@src/NeuralNetworks/Layers/SSM/MinLSTMLayer.cs`:
- Around line 370-373: The rented buffers from TensorAllocator.Rent<T>
(cellState and allCellStates) may contain junk; explicitly seed/zero them
immediately after renting and before any read at runtime: after calling
TensorAllocator.Rent<T>(...) for cellState and for allCellStates, call the
allocator/ tensor utility to clear or fill those tensors with zero (and also set
allCellStates[...,0,...] to zero) so the initial recurrent state is
deterministic; locate the allocations of cellState and allCellStates in
MinLSTMLayer.cs and insert the zeroing step before any usage (e.g., before the
first read at the current cell update).
In `@src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs`:
- Around line 425-427: The rented buffer allStates (from
TensorAllocator.Rent<T>) is not initialized so the t=0 snapshot is undefined;
after renting allStates, explicitly initialize the index 0 timeslice with the
current base state (copy _lastStates into allStates[..., 0, ...] or clear the
buffer to zeros) so that the recurrence invariant holds and readers of
_lastStates see a valid t=0 snapshot; ensure the same initialization is applied
wherever you write t+1 (the loop that writes into allStates at t+1) so the
initial state at index 0 is correctly set before the timestep loop runs.
In `@src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs`:
- Line 312: The rented attention buffer allAttnWeights (created via
TensorAllocator.Rent<T>) may contain stale upper-triangular values because only
the causal half (tk <= tq) is written later; explicitly zero the masked half of
the tensor after renting and before use so masked positions cannot leak old
data. Locate the Rent call that creates allAttnWeights/_lastAttnWeights in the
MultiLatentAttentionLayer (around the code that uses seqLen, batchSize, and
_numHeads) and clear all entries where tk > tq (the upper-triangular/masked
half) — either by calling the allocator/tensor zeroing API if available or by
iterating and setting those elements to zero — before any reads/writes that
assume those positions are zero.
In `@src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs`:
- Around line 288-290: The rented buffers h, allHidden, and allDecay in
RealGatedLinearRecurrenceLayer are coming from a pool and may contain stale
data; after calling TensorAllocator.Rent<T>(...) you must explicitly
zero-initialize them so the RG-LRU initial hidden state and timestep-0 hidden
slice are zero. Locate the Rent calls that assign h, allHidden, and allDecay and
immediately clear them (e.g., fill/zero the entire tensor or at minimum set
allHidden[...,0,...] and h to zero) so no pooled garbage is consumed before the
first write; ensure the chosen zeroing method matches your Tensor API and is
applied consistently for all three rented tensors.
In `@src/NeuralNetworks/Layers/SSM/RebasedLayer.cs`:
- Around line 424-426: The rented gradient accumulators dQ, dK, and dV returned
by TensorAllocator.Rent<T>(...) must be explicitly zeroed before any +=
accumulation to avoid leaking stale data; update the RebasedLayer implementation
to clear or fill zeros into the tensors immediately after renting (before any
use) so the subsequent increments at the accumulation sites (the += operations
around the existing lines 539-542 and 589-592) produce correct gradients.
In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs`:
- Around line 507-512: The rented recurrence state tensors hReal and hImag (and
the initial slot allHiddenReal[...,0] / allHiddenImag[...,0]) are used before
being set, so explicitly zero them after allocation instead of relying on
allocator behavior: after calling TensorAllocator.Rent<T>(...) for hReal/hImag
and for allHiddenReal/allHiddenImag, fill hReal and hImag with zeros and copy
those zeros into the index-0 slice of allHiddenReal/allHiddenImag (or
memset/clear the whole allHidden tensors if simpler) so the initial hidden state
is deterministically zero; locate these tensors where they are created in
S5Layer.cs (variables hReal, hImag, allHiddenReal, allHiddenImag) and perform
the explicit zeroing immediately after each Rent call.
In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs`:
- Around line 407-408: The rented tensors `state` and `allStates` from
`TensorAllocator.Rent<T>(...)` must be explicitly initialized to their
zero/default recurrence values before the first timestep; modify the code around
the `state` and `allStates` allocations (the `state` and `allStates` variables)
to zero-fill or otherwise set the `[*, 0, ...]` snapshot and the `state` buffer
immediately after rental (using your tensor fill/clear utility or by writing
zeros across shapes constructed from `batchSize`, `_numHeads`, `_headDimension`)
so the forward pass does not rely on `Rent` returning zeroed memory.
---
Outside diff comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 455-470: The loop currently increments aggregated[...] in-place
and relies on the rented buffer being zeroed; instead, for each combination of
b, kvh, s, d compute a local accumulator (e.g. T acc = default(T)) and inside
the g loop do acc = NumOps.Add(acc, expandedGrad[new[] { b, qh, s, d }]) and
after the g loop assign aggregated[new[] { b, kvh, s, d }] = acc; update the
nested loops around variables batchSize, _numKVHeads, _headsPerGroup,
_headDimension and keep use of TensorAllocator.Rent and expandedGrad but avoid
reading aggregated before assignment.
In `@src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs`:
- Around line 232-277: The rented per-head state tensor (state) is used before
being initialized and the output is being accumulated with += even though each
head produces each output slot exactly once; to fix, explicitly zero-initialize
the rented state immediately after Rent (the local variable state in the hi
loop) and ensure output entries for the current head are either pre-zeroed once
before the head loop or assigned (not added) when writing (change the output
update in the inner loop that currently does output[...] = output[...] + sum to
a direct assignment if you zeroed output per-head, or zero output once before
the hi loop and keep the additive update), updating the code around the state
allocation and the output write (refer to the variables state, gateVal, kVal,
vVal, qVal, and output used in the nested loops).
In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs`:
- Around line 386-451: The rented tensor "state" is pooled and may contain
garbage, but the loop reads previous state at t==0 (prevS = state[...]) —
explicitly initialize or seed the recurrent state for t==0 instead of relying on
zeros in pooled memory: inside the outer time loop (for int t = 0; t < seqLen;
t++) or before it, detect t==0 and either zero-fill the "state" tensor (using
NumOps.Zero assignments) or set prevS to NumOps.Zero when t==0 and also write
that seeded state into "state" and "allStates" (or initialize
allStates[...,0,...]) so downstream code/backward passes see a defined t=0
state; update places that read prevS (the read at prevS = state[new[] { bi, hi,
di, ki }]) to use the initialized value.
In `@src/NeuralNetworks/Layers/SSM/RWKVLayer.cs`:
- Around line 340-353: The rented temporary tensor "shifted" is never read and
should be removed from the hot timestep loop: delete the
TensorAllocator.Rent<T>(...) call and all assignments to shifted[new[] { bi, d
}] inside the nested loops, and also remove any later references or
Return/Dispose calls for "shifted" if present; keep existing computations that
build rInput/kInput/vInput from x_t and xPrev and retain use of
_timeMixR/_timeMixK/_timeMixV so behavior is unchanged while removing the unused
allocation and writes.
---
Duplicate comments:
In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 314-359: The Workspace is currently allocated in the constructor
body making it shared when LayerBase<T>.Clone() does a shallow clone; move the
allocation and all Workspace.DeclareTimestep/DeclareSequence calls into a new
private InitializeWorkspace() method (create LayerWorkspace<T>(timestepCount:
10, sequenceCount: 12) and call the same Declare* entries there) and invoke
InitializeWorkspace() from the constructor and from the RWKV7Block<T>.Clone()
override so each cloned block gets its own LayerWorkspace<T> instead of aliasing
the original Workspace; update references to Workspace and keep the existing
Ts*/Sq* constants unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ec7688bf-ff16-4868-a285-34242858d5b2
📒 Files selected for processing (50)
src/Diffusion/Audio/ShortTimeFourierTransform.cssrc/NeuralNetworks/Layers/ConvLSTMLayer.cssrc/NeuralNetworks/Layers/DiffusionConvLayer.cssrc/NeuralNetworks/Layers/GRULayer.cssrc/NeuralNetworks/Layers/GraphTransformerLayer.cssrc/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cssrc/NeuralNetworks/Layers/LSTMLayer.cssrc/NeuralNetworks/Layers/MeasurementLayer.cssrc/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cssrc/NeuralNetworks/Layers/SSM/ABCLayer.cssrc/NeuralNetworks/Layers/SSM/BASEDLayer.cssrc/NeuralNetworks/Layers/SSM/DeltaFormerLayer.cssrc/NeuralNetworks/Layers/SSM/DeltaNetLayer.cssrc/NeuralNetworks/Layers/SSM/DeltaProductLayer.cssrc/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cssrc/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cssrc/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cssrc/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/HGRN2Layer.cssrc/NeuralNetworks/Layers/SSM/HGRNLayer.cssrc/NeuralNetworks/Layers/SSM/HedgehogLayer.cssrc/NeuralNetworks/Layers/SSM/HyenaLayer.cssrc/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/LinearRecurrentUnitLayer.cssrc/NeuralNetworks/Layers/SSM/LogLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/LonghornLayer.cssrc/NeuralNetworks/Layers/SSM/MEGALayer.cssrc/NeuralNetworks/Layers/SSM/Mamba2Block.cssrc/NeuralNetworks/Layers/SSM/MegalodonLayer.cssrc/NeuralNetworks/Layers/SSM/MesaNetLayer.cssrc/NeuralNetworks/Layers/SSM/MinGRULayer.cssrc/NeuralNetworks/Layers/SSM/MinLSTMLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cssrc/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/PaTHAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/RWKV7Block.cssrc/NeuralNetworks/Layers/SSM/RWKVLayer.cssrc/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cssrc/NeuralNetworks/Layers/SSM/RebasedLayer.cssrc/NeuralNetworks/Layers/SSM/RetNetLayer.cssrc/NeuralNetworks/Layers/SSM/RodimusLayer.cssrc/NeuralNetworks/Layers/SSM/S4DLayer.cssrc/NeuralNetworks/Layers/SSM/S5Layer.cssrc/NeuralNetworks/Layers/SSM/S6Scan.cssrc/NeuralNetworks/Layers/SSM/ScanPatterns.cssrc/NeuralNetworks/Layers/SSM/TTTLayer.cssrc/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cssrc/NeuralNetworks/Layers/SpikingLayer.cs
…terns + fixes - 18 ones-tensor allocations eliminated via Engine.ScalarMinusTensor (ABCLayer, GatedDeltaNet, HGRN2, Hedgehog, Hyena, Kimi, LogLinear, MesaNet, MixtureOfMamba/Memories, MultiLatentAttention, Rebased, RetNet, Rodimus, TTT, TransNormerLLM) - 75 bracket-syntax allocations converted to TensorAllocator.Rent (GraphTransformerLayer 13, GraphAttentionLayer 6, Megalodon 4, CapsuleLayer 2, ConvLSTMLayer 2, plus 30+ other layers) - Fixed 20 SiLU derivative methods where ones variable was removed but still referenced (CreateOnesLike fallback) Running totals: 1405 raw → ~1330 raw, 407 pooled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (18)
src/NeuralNetworks/Layers/HeterogeneousGraphLayer.cs (1)
404-455:⚠️ Potential issue | 🟠 MajorRented output buffer is not lifecycle-safe with current reassignment pattern.
At Line 404,
outputis rented, but Line 454 reassignsoutput = Engine.TensorAdd(output, convOutput). IfTensorAddallocates (non-in-place), the originally rented buffer is orphaned and this path still allocates per edge type. That risks pool pressure and undermines the zero-allocation objective.Suggested safe fix (until in-place accumulation/explicit return is wired)
- var output = TensorAllocator.Rent<T>([batchSize, processNumNodes, _outputFeatures]); + var output = new Tensor<T>([batchSize, processNumNodes, _outputFeatures]); output.Fill(NumOps.Zero);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/HeterogeneousGraphLayer.cs` around lines 404 - 455, The rented buffer "output" from TensorAllocator.Rent<T> is being orphaned by the reassignment output = Engine.TensorAdd(output, convOutput), which can allocate a new tensor each edge type; instead perform in-place accumulation into the rented buffer to avoid allocations: replace the reassignment with an in-place add into the existing "output" buffer (use the library's in-place primitive if available, e.g., Engine.AddInto/Engine.TensorAddInPlace or a manual elementwise loop that writes convOutput into "output"), ensuring you still use the same "output" instance across iterations and only return/dispose it after the loop; update code paths referencing output to not reassign it and keep Engine.TensorAdd usage only if it has an explicit "target" parameter that writes into the rented buffer.src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs (4)
173-180:⚠️ Potential issue | 🔴 CriticalBLOCKING: Missing
GetMetadatamethod required by golden pattern.Per the neural network guidelines, every layer must implement
GetMetadatareturning a dictionary with at minimum:ModelType,ParameterCount,Architecture, andInputShape/OutputShape.🔧 Required implementation
+ /// <inheritdoc/> + public override Dictionary<string, object> GetMetadata() + { + return new Dictionary<string, object> + { + ["ModelType"] = "PiecewiseLinearEncodingLayer", + ["ParameterCount"] = ParameterCount, + ["Architecture"] = $"PLELayer(features={_numFeatures}, bins={_numBins})", + ["InputShape"] = InputShape, + ["OutputShape"] = OutputShape, + ["NumFeatures"] = _numFeatures, + ["NumBins"] = _numBins + }; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 173 - 180, The PiecewiseLinearEncodingLayer class is missing the required GetMetadata method; add a public override IDictionary<string, object> GetMetadata() in PiecewiseLinearEncodingLayer that returns a dictionary containing at minimum "ModelType" (e.g., the layer type name or enum), "ParameterCount" (use _binBoundaries.Length or result of GetParameters().Length), "Architecture" (a short string describing the piecewise-linear encoding), and "InputShape"/"OutputShape" (compute shapes based on existing input/output properties or constructor parameters); ensure the method signature matches the base Layer/LayerBase contract and use the same generic T where appropriate so callers expecting metadata can consume these keys.
7-24:⚠️ Potential issue | 🟡 MinorMissing reference citation in XML documentation.
The golden pattern requires a
<para><b>Reference:</b>section citing the original research paper. Piecewise Linear Encoding is from the TabM paper and should be cited.📚 Add reference
/// It's similar to how histograms work, but with soft (differentiable) boundaries. /// </para> +/// <para> +/// <b>Reference:</b> Gorishniy et al., "TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling" (2024). +/// </para> /// </remarks>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 7 - 24, Add a Reference citation to the XML documentation for the PiecewiseLinearEncodingLayer class: update the <remarks> block in PiecewiseLinearEncodingLayer (file/class name: PiecewiseLinearEncodingLayer.cs / class PiecewiseLinearEncodingLayer) to include a <para><b>Reference:</b> entry that cites the TabM paper (author(s), title, venue/year and DOI or URL) so the documentation follows the golden pattern and attributes the original research.
164-171: 🧹 Nitpick | 🔵 Trivial
ResetStateshould consider returning rented tensor to pool.If
TensorAllocatorexpects explicit returns, the cached_outputCachetensor should be returned here before nulling the reference. Otherwise you're leaking pooled memory.♻️ Potential fix (if Return is needed)
public override void ResetState() { + if (_outputCache != null) + { + TensorAllocator.Return(_outputCache); + } _inputCache = null; _outputCache = null; Engine.TensorFill(_binBoundaryGradients, NumOps.Zero); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 164 - 171, ResetState currently nulls cached tensors but doesn't return rented buffers to the allocator; change ResetState to check for non-null cached tensors (at minimum _outputCache, and consider _inputCache) and return them to the TensorAllocator/engine (e.g., call the allocator's Return/Release method or Engine-equivalent for rented tensors) before setting the fields to null, then continue with Engine.TensorFill(_binBoundaryGradients, NumOps.Zero).
44-45:⚠️ Potential issue | 🔴 CriticalBLOCKING:
SupportsTrainingistruebut noBackwardmethod exists.This layer declares it supports training but provides absolutely no mechanism to backpropagate gradients. This is fundamentally broken. Per the golden pattern, every neural network layer must have a
Backwardmethod that acceptsTensor<T> outputGradientand returnsTensor<T>.The gradient flow for piecewise linear encoding needs to:
- Compute gradients w.r.t.
_binBoundaries(accumulate into_binBoundaryGradients)- Compute gradients w.r.t. input features and return them
Without this, any training pipeline using this layer will fail or silently produce incorrect results.
🔧 Required implementation skeleton
+ /// <inheritdoc/> + public override Tensor<T> Backward(Tensor<T> outputGradient) + { + if (_inputCache == null) + throw new InvalidOperationException("Forward must be called before Backward"); + + int batchSize = _inputCache.Shape[0]; + var inputGradient = TensorAllocator.Rent<T>([batchSize, _numFeatures]); + + for (int b = 0; b < batchSize; b++) + { + for (int f = 0; f < _numFeatures; f++) + { + var value = _inputCache[b * _numFeatures + f]; + var grad = ComputeFeatureGradient(value, f, outputGradient, b); + inputGradient[b * _numFeatures + f] = grad; + } + } + + return inputGradient; + } + + private T ComputeFeatureGradient(T value, int featureIdx, Tensor<T> outputGradient, int batchIdx) + { + int outputOffset = batchIdx * _numFeatures * _numBins + featureIdx * _numBins; + int boundaryOffset = featureIdx * (_numBins - 1); + var inputGrad = NumOps.Zero; + + // Gradient through first bin + var firstBoundary = _binBoundaries[boundaryOffset]; + var firstDiff = NumOps.Subtract(value, firstBoundary); + if (NumOps.Compare(firstDiff, NumOps.Zero) > 0 && NumOps.Compare(firstDiff, NumOps.One) < 0) + { + inputGrad = NumOps.Add(inputGrad, outputGradient[outputOffset]); + _binBoundaryGradients[boundaryOffset] = NumOps.Subtract( + _binBoundaryGradients[boundaryOffset], outputGradient[outputOffset]); + } + + // TODO: Complete gradient computation for middle and last bins + // This requires tracking which path was taken in the Min operation + + return inputGrad; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 44 - 45, The layer advertises training but lacks a Backward implementation; add a public override Tensor<T> Backward(Tensor<T> outputGradient) in the PiecewiseLinearEncodingLayer class that computes and returns the gradient w.r.t. the layer inputs and accumulates gradients w.r.t. the bin boundaries into the existing _binBoundaryGradients field; specifically, compute per-output-bin contributions to ∂L/∂input by mapping outputGradient through the piecewise-linear encoding's local slopes and compute ∂L/∂_binBoundaries by summing the contribution where inputs sit at bin boundaries, updating _binBoundaryGradients (not replacing it), and ensure the method signature and behavior align with other layers' Backward contracts so SupportsTraining can remain true.src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs (2)
906-910:⚠️ Potential issue | 🔴 CriticalBlocking: validate
avgDegreebefore scaler division.Line [909] and Line [918] rely on
_avgDegreein division paths. If constructor inputavgDegree <= 0, this can produceInf/NaNand poison forward/backward values. Add a fail-fast guard in the constructor.Proposed fix
public PrincipalNeighbourhoodAggregationLayer( int inputFeatures, int outputFeatures, PNAAggregator[]? aggregators = null, PNAScaler[]? scalers = null, double avgDegree = 1.0, IActivationFunction<T>? activationFunction = null, IInitializationStrategy<T>? initializationStrategy = null) : base([inputFeatures], [outputFeatures], activationFunction ?? new IdentityActivation<T>()) { + if (avgDegree <= 0.0) + { + throw new ArgumentOutOfRangeException(nameof(avgDegree), "avgDegree must be > 0."); + } + InitializationStrategy = initializationStrategy ?? InitializationStrategies<T>.Eager;As per coding guidelines “missing validation of external inputs” is a blocking production-readiness issue.
Also applies to: 915-919
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs` around lines 906 - 910, The constructor(s) of PrincipalNeighbourhoodAggregationLayer must validate the avgDegree parameter and fail-fast if it's not > 0 to prevent Inf/NaN during division; add a guard in the class constructor(s) that sets _avgDegree (e.g., verify avgDegree > 0 and throw an ArgumentException/ArgumentOutOfRangeException with a clear message) and ensure any overloads or factory methods that assign _avgDegree perform the same check so that downstream code in TensorBroadcastDivide/Multiply (references: _avgDegree, avgDegreeTensor, Engine.TensorBroadcastDivide, TensorAllocator.Rent) never executes with non-positive avgDegree.
945-981:⚠️ Potential issue | 🔴 CriticalBlocking: scaler backprop is mathematically incorrect for non-identity scalers.
Line [967]-[968] sets
scalerGrad = gradSlicefor all scalers. That drops derivative factors forPNAScaler.AmplificationandPNAScaler.Attenuation, yielding incorrect gradients during training.Proposed fix
- // Backprop through scaler (simplified - assumes identity for gradient) - var scalerGrad = gradSlice; + // Backprop through scaler + Tensor<T> scalerGrad; + switch (_scalers[scalerIdx]) + { + case PNAScaler.Identity: + scalerGrad = gradSlice; + break; + case PNAScaler.Amplification: + { + var avgDegreeTensor = TensorAllocator.Rent<T>([batchSize, numNodes, 1]); + avgDegreeTensor.Fill(NumOps.FromDouble(_avgDegree)); + var factor = Engine.TensorBroadcastDivide(degExpanded, avgDegreeTensor); // degree / avgDegree + scalerGrad = Engine.TensorBroadcastMultiply(gradSlice, factor); + break; + } + case PNAScaler.Attenuation: + { + var avgDegreeTensor = TensorAllocator.Rent<T>([batchSize, numNodes, 1]); + avgDegreeTensor.Fill(NumOps.FromDouble(_avgDegree)); + var factor = Engine.TensorBroadcastDivide(avgDegreeTensor, degExpanded); // avgDegree / degree + scalerGrad = Engine.TensorBroadcastMultiply(gradSlice, factor); + break; + } + default: + scalerGrad = gradSlice; + break; + }As per coding guidelines “Simplified implementations ... missing proper logic” are blocking issues requiring immediate fix.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs` around lines 945 - 981, The code incorrectly sets scalerGrad = gradSlice for all scalers; replace that with scaler-specific backprop: inspect the scaler kind via _scalers[scalerIdx] and compute scalerGrad by multiplying gradSlice with the derivative of the forward scaler transform (e.g., for PNAScaler.Amplification multiply by the amplification factor s used in forward, for PNAScaler.Attenuation multiply by the attenuation derivative — typically 1/s or the attenuation factor used), using the Engine tensor multiply/broadcast helpers and preserving shape ([batchSize, numNodes, _inputFeatures]); ensure you retrieve the exact scalar/factor used in the forward pass (the same variable/array used when applying scalers) and apply it with Engine.TensorMultiply or TensorScalarMultiply before continuing aggregation/backprop.src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs (2)
46-48:⚠️ Potential issue | 🔴 CriticalBLOCKING: Missing
Backwardmethod despiteSupportsTraining => true.This layer declares it supports training but has no
Backwardmethod implementation. Per the golden pattern guidelines, every layer that supports training must implement backpropagation.The cached values
_inputCacheand_normalizedCachesuggest backward was intended but never implemented. This is incomplete code that cannot be shipped.You need to implement:
public override Tensor<T> Backward(Tensor<T> outputGradient) { // Backprop through layer norm // Backprop through residual connection // Backprop through attention (transpose, attention scores, projections) // Return gradient w.r.t. input }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs` around lines 46 - 48, The layer declares SupportsTraining => true but lacks the required Backward implementation; implement public override Tensor<T> Backward(Tensor<T> outputGradient) in IntersampleAttentionLayer to compute gradients through the three main components: first backprop the residual connection (split outputGradient into gradient to the attention output and passthrough to input using the saved _inputCache), then backprop through the attention block by reversing the projection steps (compute gradients w.r.t. output projection, attention output, and input projections—queries/keys/values—using transposes of the same weight matrices used in Forward, and backprop through attention scores/softmax by using saved attention scores or by recomputing from cached normalized inputs), and finally backprop through the LayerNorm using _normalizedCache to produce gradient w.r.t. the pre-normalized input; ensure you accumulate gradients into the returned Tensor<T> matching the input shape and update any parameter gradient holders for the projection and output weight matrices consistently with how Forward performed projection and residual addition.
117-117: 🧹 Nitpick | 🔵 TrivialInconsistent allocation strategy within the same method.
Line 117 uses
new Tensor<T>()for theoutputtensor while line 124 usesTensorAllocator.Rent<T>()forfeatureSlice. If the PR goal is zero-allocation forward passes, theoutputtensor should also use the pooled allocator for consistency.Additionally, the intermediate tensors from projection layers (
queries,keys,values) and operations (kT,scores,attended) may also contribute to allocation overhead.♻️ Suggested improvement
- var output = new Tensor<T>([numFeatures, batchSize, embDim]); + var output = TensorAllocator.Rent<T>([numFeatures, batchSize, embDim]);Note: If this output tensor is returned from the method (and ultimately from
Forward), you'll need a strategy to ensure the caller or the layer's lifecycle management returns it to the pool appropriately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs` at line 117, The output tensor is allocated directly with new Tensor<T>(...) while other temporaries use TensorAllocator.Rent<T>(), causing inconsistent allocation; change the allocation of output to use TensorAllocator.Rent<T>(...) with the same shape as new Tensor<T>([numFeatures, batchSize, embDim]) and ensure it is returned to the pool by the caller or layer lifecycle if Forward returns it. Also review the projection tensors (queries, keys, values) and intermediate results (kT, scores, attended, featureSlice) and convert their allocations to use TensorAllocator.Rent<T>() consistently to eliminate stray allocations; update any ownership/return-to-pool semantics so rented tensors are released appropriately after use or by the caller for returned tensors.src/NeuralNetworks/Layers/QuantumLayer.cs (1)
218-228: 🧹 Nitpick | 🔵 TrivialInconsistent allocation strategy within
Forwardmethod.Line 210 was converted to
TensorAllocator.Rent<T>, but these intermediate tensors still usenew Tensor<T>():
- Line 218:
imagState- Line 223:
epsilonTensorFor consistent zero-allocation forward passes per PR objectives, these should also use pooled allocation.
♻️ Proposed fix for consistency
- var imagState = new Tensor<T>(realState.Shape.ToArray()); + var imagState = TensorAllocator.Rent<T>(realState.Shape.ToArray()); // Normalize each batch item: divide by sqrt(sum(|state|^2) + eps) var magnitudeSquared = Engine.ComplexMagnitudeSquared(realState, imagState); var normPerBatch = Engine.ReduceSum(magnitudeSquared, [1], keepDims: true); - var epsilonTensor = new Tensor<T>(normPerBatch.Shape.ToArray()); + var epsilonTensor = TensorAllocator.Rent<T>(normPerBatch.Shape.ToArray()); epsilonTensor.Fill(NumOps.FromDouble(1e-10));Note: Ensure proper return/disposal of these rented tensors as well (same concern as line 210).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/QuantumLayer.cs` around lines 218 - 228, The Forward method mixes pooled and non-pooled allocations: replace the direct allocations of imagState and epsilonTensor (created with new Tensor<T>) with pooled allocations via TensorAllocator.Rent<T> (matching the change at line 210), and ensure you return/dispose these rented tensors (as you do for other rented tensors) after use; update creation of denomExpanded/normalizedReal/normalizedImag to use the rented tensors as inputs and add the corresponding TensorAllocator.Return/Dispose calls for imagState and epsilonTensor in the same cleanup path.src/NeuralNetworks/Layers/BidirectionalLayer.cs (1)
52-54:⚠️ Potential issue | 🟡 MinorRemove orphaned XML documentation.
This XML comment describes "The computation engine (CPU or GPU) for vectorized operations" but there's no corresponding field or property defined after it. This appears to be leftover documentation from a removed or never-implemented member.
🧹 Proposed fix
- /// <summary> - /// The computation engine (CPU or GPU) for vectorized operations. - /// </summary> - /// <summary> /// Gets a value indicating whether this layer supports training.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/BidirectionalLayer.cs` around lines 52 - 54, Remove the orphaned XML doc comment that reads "The computation engine (CPU or GPU) for vectorized operations" in the BidirectionalLayer class; locate the stray triple-slash XML block in src/NeuralNetworks/Layers/BidirectionalLayer.cs (near the BidirectionalLayer class declaration) and delete it so there is no unused/mismatched summary comment left without a corresponding field or property. If the comment was meant for a member, instead move or reattach it to the correct field/property; otherwise simply remove the orphaned XML block.src/NeuralNetworks/Layers/FeatureTransformerLayer.cs (2)
59-60:⚠️ Potential issue | 🔴 CriticalBLOCKING:
SupportsTraining => truebut noBackwardmethod implemented.Per the golden pattern for neural network classes: "Backward Method: Must accept
Tensor<T> outputGradientand returnTensor<T>. Must implement actual backpropagation through all layers."The layer claims training support and even caches
_intermediateOutputs(line 57) for backprop, but there's noBackwardimplementation. This makes gradient-based training impossible. This is a blocking issue that must be resolved before merge.Production-ready code requires a complete
Backwardmethod that:
- Accepts
Tensor<T> outputGradient- Backpropagates through step-specific layers (reverse order)
- Backpropagates through shared layers (reverse order)
- Applies GLU backward (gradient through sigmoid gate and element-wise multiply)
- Returns input gradient
📐 Skeleton for required Backward implementation
/// <summary> /// Performs the backward pass through the Feature Transformer. /// </summary> /// <param name="outputGradient">Gradient from the next layer.</param> /// <returns>Gradient with respect to the input.</returns> public override Tensor<T> Backward(Tensor<T> outputGradient) { var currentGrad = outputGradient; int totalLayers = _numSharedLayers + _numStepSpecificLayers; int layerIdx = totalLayers - 1; // Backward through step-specific layers (reverse order) for (int i = _numStepSpecificLayers - 1; i >= 0; i--) { // Undo residual scaling if applied if (_intermediateOutputs[layerIdx].Shape[1] == currentGrad.Shape[1] && layerIdx > 0) { currentGrad = Engine.TensorMultiplyScalar(currentGrad, NumOps.FromDouble(Math.Sqrt(0.5))); // Add residual gradient path } // GLU backward currentGrad = BackwardGLU(currentGrad, /* cached pre-GLU output */); // BN backward currentGrad = _stepBNLayers[i].Backward(currentGrad); // FC backward currentGrad = _stepFCLayers[i].Backward(currentGrad); layerIdx--; } // Backward through shared layers (reverse order) for (int i = _numSharedLayers - 1; i >= 0; i--) { // Similar backward logic for shared layers currentGrad = _sharedBNLayers[i].Backward(currentGrad); currentGrad = _sharedFCLayers[i].Backward(currentGrad); layerIdx--; } return currentGrad; } private Tensor<T> BackwardGLU(Tensor<T> outputGrad, Tensor<T> preGluInput) { // Implement GLU backward: d/d(values) and d/d(gates) via chain rule // through sigmoid and element-wise multiply throw new NotImplementedException("Complete GLU backward implementation required"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/FeatureTransformerLayer.cs` around lines 59 - 60, FeatureTransformerLayer declares SupportsTraining => true but lacks the required Backward implementation; implement public override Tensor<T> Backward(Tensor<T> outputGradient) to perform full backprop using the cached _intermediateOutputs: iterate reverse through step-specific layers then shared layers, undo any residual scaling (apply Engine.TensorMultiplyScalar with NumOps.FromDouble(Math.Sqrt(0.5)) where residuals were scaled), call each layer's .Backward (e.g. _stepFCLayers[i].Backward, _stepBNLayers[i].Backward, _sharedFCLayers[i].Backward, _sharedBNLayers[i].Backward), implement a BackwardGLU helper to backprop through the GLU (compute gradients wrt gate via sigmoid derivative and element-wise multiply for values and gates), decrement layerIdx in sync with _intermediateOutputs and finally return the gradient w.r.t. the input.
206-220:⚠️ Potential issue | 🟠 MajorRented tensors
valuesandgatesare never returned to the pool — resource leak.The PR's goal is zero-allocation forward passes via pooled allocators. However, after
valuesandgatesare consumed byEngine.SigmoidandEngine.TensorMultiply, they're abandoned without being returned. This defeats the pooling benefit and may cause unbounded memory growth under repeated forward passes.🛠️ Proposed fix: Return rented tensors after use
// Apply sigmoid to gates and multiply with values var sigmoidGates = Engine.Sigmoid(gates); - return Engine.TensorMultiply(values, sigmoidGates); + var result = Engine.TensorMultiply(values, sigmoidGates); + + // Return rented tensors to the pool + TensorAllocator.Return(values); + TensorAllocator.Return(gates); + + return result; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/FeatureTransformerLayer.cs` around lines 206 - 220, The rented tensors values and gates (from TensorAllocator.Rent<T>) and the intermediate sigmoidGates produced by Engine.Sigmoid are not returned to the pool; wrap the computation in a try/finally, assign the final output from Engine.TensorMultiply to a local (e.g., result), then in finally call the allocator return method for values, gates and sigmoidGates (e.g., TensorAllocator.Return<T>(values), TensorAllocator.Return<T>(gates), TensorAllocator.Return<T>(sigmoidGates)) before returning result; ensure you reference the existing symbols Engine.Sigmoid, Engine.TensorMultiply, TensorAllocator.Rent<T>, and the local variables values/gates/sigmoidGates when applying the fix.src/NeuralNetworks/Layers/DigitCapsuleLayer.cs (1)
438-458:⚠️ Potential issue | 🔴 CriticalBLOCKING: Missing TensorAllocator.Return() calls — resource leak violating pooling contract.
This code violates AiDotNet's standard rent/return pooling lifecycle. The
predictions(line 438),couplings(line 458), andoutputtensors are rented from the pool but never explicitly returned.The pattern is further broken by variable reassignment without cleanup:
- Line 479:
output = activated;orphans the originally rented tensor- Line 486:
couplings = Engine.TensorAdd(couplings, dot);orphans the originally rented tensorAiDotNet's pooling contract (System.Buffers.ArrayPool-backed) requires: Rent → Use → Return buffer. Failure to return causes pool starvation and memory pressure in production workloads.
Every rented tensor must be explicitly returned before reassignment or method exit. Refactor to add Return calls:
Example fix pattern
var predictions = TensorAllocator.Rent<T>([...]); try { // ... use predictions } finally { TensorAllocator.Return(predictions); }Or restructure to avoid reassignment without explicit cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/DigitCapsuleLayer.cs` around lines 438 - 458, The code rents tensors (predictions, couplings, output) via TensorAllocator.Rent but never returns them, causing a pooling leak; update the DigitCapsuleLayer method to ensure every TensorAllocator.Rent<T>(...) has a matching TensorAllocator.Return(...) in all control-flow paths (including before any reassignment like output = activated and couplings = Engine.TensorAdd(...)), e.g., wrap usage in try/finally or return the original tensor before reassigning, and ensure temporary tensors produced by operations such as Engine.TensorAdd and any "activated" tensors are also returned when no longer needed so that predictions, couplings, output, and any intermediates are always returned to the pool.src/NeuralNetworks/Layers/SpatialTransformerLayer.cs (1)
743-783:⚠️ Potential issue | 🔴 CriticalBLOCKING: Rented tensor lacks proper lifecycle management for pooled allocation.
The tensor rented via
TensorAllocator.Rent<T>is stored in_lastTransformationMatrixwithout proper disposal semantics. On subsequent forward passes, this field is overwritten without returning the previously rented tensor to the pool.AiDotNet.Tensors uses pooled
Memory<T>backing with IDisposable cleanup required to safely return buffers. Without explicit return, this causes:
- Pool exhaustion over repeated forward calls
- Data corruption if the pool reuses unreturned buffers before the layer finishes using them
- Memory bloat in production scenarios with many layer instances
The fix requires proper lifecycle management:
- Either use
try-finallyto guarantee return of rented tensors, or- Wrap rented tensors in
usingscopes to ensure Dispose is calledEnsure the tensor is returned to the pool immediately after use in
ConvertToTransformationMatrix, or redesign to allocate once and reuse safely without pooling semantics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SpatialTransformerLayer.cs` around lines 743 - 783, ConvertToTransformationMatrix currently calls TensorAllocator.Rent<T> and assigns the result (via tensor) to the layer field _lastTransformationMatrix without returning/disposing any previously rented buffer, leaking pooled Memory<T>. Fix by ensuring pooled tensors are returned to the allocator before overwriting: before assigning _lastTransformationMatrix = tensor, check if _lastTransformationMatrix is non-null and call its Dispose/Return method (or TensorAllocator.Return) in a try/finally; alternatively, allocate a non-pooled tensor or implement a reuse path that disposes the rented tensor when the layer is disposed. Reference: ConvertToTransformationMatrix, TensorAllocator.Rent<T>, and the field _lastTransformationMatrix.src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs (2)
796-808: 🧹 Nitpick | 🔵 TrivialMinor allocation from
.ToArray()and continued syntax inconsistency.Line 796 calls
dOutput.Shape.ToArray()which allocates a new array on each backward call, partially defeating the zero-allocation goal. Consider caching or using a stackalloc pattern if Shape is a span-compatible type.Additionally, line 808 uses collection expression
[...]while line 796 uses.ToArray()- more syntax inconsistency.- var dPreNorm = TensorAllocator.Rent<T>(dOutput.Shape.ToArray()); + var dPreNorm = TensorAllocator.Rent<T>([batchSize, seqLen, _emaDimension]);(Using explicit dimensions avoids the
.ToArray()allocation sincebatchSize,seqLen, and_emaDimensionare already known.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs` around lines 796 - 808, The allocation dPreNorm uses TensorAllocator.Rent<T>(dOutput.Shape.ToArray()) which creates an unnecessary array each backward call; replace that call with an explicit dimensions array using the known locals (e.g. new[] { batchSize, seqLen, _emaDimension }) so it becomes TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _emaDimension }); also make the syntax consistent with the later allocation for dEmaInput3D by using the same explicit new[] { ... } form (or stackalloc if Shape is already a Span<int>) instead of mixing .ToArray() and collection-expression syntax.
462-517: 🧹 Nitpick | 🔵 TrivialInconsistent syntax between collection expressions and array initializers.
CEMAKernelForwarduses C# 12 collection expressions ([...]) whileCEMAForwardusesnew[] {...}syntax. Both work, but consistency improves readability.- var statesReal = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _emaDimension }); - var statesImag = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _emaDimension }); - var outputPreNorm = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _emaDimension }); + var statesReal = TensorAllocator.Rent<T>([batchSize, seqLen + 1, _emaDimension]); + var statesImag = TensorAllocator.Rent<T>([batchSize, seqLen + 1, _emaDimension]); + var outputPreNorm = TensorAllocator.Rent<T>([batchSize, seqLen, _emaDimension]);Same applies to lines 516-517:
- var outputNorm = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _emaDimension }); - var stdInv = TensorAllocator.Rent<T>(new[] { batchSize, seqLen }); + var outputNorm = TensorAllocator.Rent<T>([batchSize, seqLen, _emaDimension]); + var stdInv = TensorAllocator.Rent<T>([batchSize, seqLen]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs` around lines 462 - 517, The code mixes C# 12 collection expressions and legacy array initializers (e.g., CEMAKernelForward uses [...] while CEMAForward and the allocations for statesReal, statesImag, outputPreNorm, outputNorm, stdInv use new[] { ... }), so make them consistent: pick the project's preferred style (prefer C# 12 collection expressions if the project targets that language version) and update all tensor allocation and array/collection initializers in MegalodonLayer.cs (references: statesReal, statesImag, outputPreNorm, outputNorm, stdInv) and the methods CEMAKernelForward / CEMAForward to use the same syntax throughout. Ensure the changed syntax compiles under the repository's LangVersion and run the unit/build to verify.src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs (1)
452-474:⚠️ Potential issue | 🟠 MajorMemory leak in
ResetState: Rented tensors should be returned before nulling.When
ResetState()is called, the fields_lastKVGate,_lastKVGateRaw, and_lastStates(which were allocated viaTensorAllocator.Rent) are set tonullwithout being returned to the pool. This leaks the rented memory.🔧 Proposed fix to return rented tensors
public override void ResetState() { _lastInput = null; _lastOutput = null; _lastQuery = null; _lastKey = null; _lastValue = null; + if (_lastKVGate != null) + TensorAllocator.Return(_lastKVGate); _lastKVGate = null; + if (_lastKVGateRaw != null) + TensorAllocator.Return(_lastKVGateRaw); _lastKVGateRaw = null; _lastOutputGate = null; _lastOutputGateRaw = null; + if (_lastStates != null) + TensorAllocator.Return(_lastStates); _lastStates = null; _lastRecurrenceOutput = null; _originalInputShape = null; // ... gradient fields unchanged }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs` around lines 452 - 474, The ResetState method currently nulls out rented tensors causing leaks; before setting fields to null, check each rented buffer (at least _lastKVGate, _lastKVGateRaw, and _lastStates) and return them to the allocator (e.g., TensorAllocator.Return or the project's equivalent) if they are non-null, then set the fields to null; apply the same pattern for any other fields that hold rented tensors (inspect _lastInput/_lastOutput/_lastQuery/_lastKey/_lastValue etc.) to ensure all rented resources are returned before nulling.
♻️ Duplicate comments (4)
src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs (1)
407-408:⚠️ Potential issue | 🔴 CriticalBlocking: initialize rented recurrence state before first timestep read.
This is a blocking correctness issue. Line 427 reads
state[...]before any write in this method, andallStates[..., 0, ...]is never initialized. IfTensorAllocator.Rent<T>returns non-zero memory, forward becomes nondeterministic.💡 Suggested fix
var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension }); var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension }); + state.Fill(NumOps.Zero); + for (int bi = 0; bi < batchSize; bi++) + for (int hi = 0; hi < _numHeads; hi++) + for (int di = 0; di < _headDimension; di++) + for (int dj = 0; dj < _headDimension; dj++) + allStates[new[] { bi, 0, hi, di, dj }] = NumOps.Zero;As per coding guidelines, "Every PR must contain production-ready code" and "All methods have complete, production-ready implementations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs` around lines 407 - 408, The rented tensors state and allStates must be initialized before any read: after creating state with TensorAllocator.Rent<T>(...) and allStates with TensorAllocator.Rent<T>(...), explicitly zero (or set to the correct initial recurrence values) every element of state and set allStates[..., 0, ...] = state (or the chosen initial values) so the first timestep read is deterministic; locate the allocations around the state and allStates variables in the Forward/Process method and add initialization logic that fills the buffers with zeros or the appropriate initial hidden/state values before any indexed reads occur.src/NeuralNetworks/Layers/SSM/S5Layer.cs (1)
507-512:⚠️ Potential issue | 🔴 CriticalBlocking: zero the rented recurrent state before the first scan step.
hReal/hImagare read at Line 523 and Line 524 before any write, and the[*, 0, *]slice ofallHiddenReal/allHiddenImagis later consumed as the base state at Line 846 and Line 847. After switching fromnew Tensor<T>toTensorAllocator.Rent<T>, the scan can now depend on stale pooled contents.Suggested fix
var hReal = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension }); var hImag = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension }); +hReal.Fill(NumOps.Zero); +hImag.Fill(NumOps.Zero);var allHiddenReal = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _stateDimension }); var allHiddenImag = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _stateDimension }); +for (int bi = 0; bi < batchSize; bi++) + for (int n = 0; n < _stateDimension; n++) + { + allHiddenReal[new[] { bi, 0, n }] = NumOps.Zero; + allHiddenImag[new[] { bi, 0, n }] = NumOps.Zero; + }As per coding guidelines, "Every PR must contain production-ready code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs` around lines 507 - 512, The rented tensors hReal, hImag and the buffers allHiddenReal/allHiddenImag returned by TensorAllocator.Rent<T> can contain stale pooled data; before the first scan step (before any reads at the scan start and before storing/consuming the [*,0,*] slice) explicitly zero-fill hReal and hImag and zero the [batch,0,stateDim] slice of allHiddenReal/allHiddenImag (or zero the entire allocated buffers) so the scan and subsequent consumption use a deterministic zero initial state; locate the allocations of hReal/hImag and allHiddenReal/allHiddenImag and insert a clear/Fill(0) call immediately after each Rent<T> and before the scan/first-read points.src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs (1)
312-312:⚠️ Potential issue | 🟠 MajorStill zero the masked half of
_lastAttnWeights.Only the causal half is written later, so this swap now depends on
TensorAllocator.Rent<T>returning zeroed memory fortk > tq. Forward output is fine, but the cached_lastAttnWeightsis no longer guaranteed to have zeros in masked positions.Suggested fix
- var allAttnWeights = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, seqLen, seqLen }); + var allAttnWeights = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, seqLen, seqLen }); + allAttnWeights.Fill(NumOps.Zero);This verifies whether
Rentclears buffers and shows the causal write pattern that leaves the upper triangle untouched:Verification script
#!/bin/bash set -euo pipefail echo "== TensorAllocator implementation ==" fd -i 'TensorAllocator*.cs' . -x sed -n '1,240p' {} echo echo "== MLA causal attention writes ==" sed -n '307,373p' src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs` at line 312, The rented buffer allAttnWeights from TensorAllocator.Rent<T> in MultiLatentAttentionLayer currently relies on Rent returning zeros for masked positions, but only the causal (lower-triangle) entries are written so _lastAttnWeights may contain garbage in the masked half; fix by explicitly zeroing the masked/upper-triangle region (or the entire tensor) immediately after creating allAttnWeights (and before swapping into _lastAttnWeights) so that entries where tk > tq are guaranteed zeroed regardless of TensorAllocator behavior; locate the allocation site of allAttnWeights and ensure an explicit zero-fill of the appropriate indices (or whole buffer) before further use.src/NeuralNetworks/Layers/SSM/RWKV7Block.cs (1)
314-341:⚠️ Potential issue | 🟠 MajorClone() still shares workspace buffers — unresolved from previous review.
The workspace is created in
InitializeParameters()butRWKV7Block<T>does not overrideClone(). SinceLayerBase<T>.Clone()usesMemberwiseClone(), cloned instances will share the sameWorkspacereference, causing buffer corruption when both instances run forward passes.Factor the workspace initialization into a helper method and call it from both the constructor and a new
Clone()override:🔧 Proposed fix
+ private void InitializeWorkspace() + { + Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12); + // TimeMixing timestep buffers + Workspace.DeclareTimestep(TsRInput, _modelDimension); + Workspace.DeclareTimestep(TsKInput, _modelDimension); + Workspace.DeclareTimestep(TsVInput, _modelDimension); + Workspace.DeclareTimestep(TsAInput, _modelDimension); + Workspace.DeclareTimestep(TsBInput, _modelDimension); + Workspace.DeclareTimestep(TsWkvOut, _modelDimension); + Workspace.DeclareTimestep(TsYt, _modelDimension); + // ChannelMixing timestep buffers + Workspace.DeclareTimestep(TsCmRInput, _modelDimension); + Workspace.DeclareTimestep(TsCmKInput, _modelDimension); + Workspace.DeclareTimestep(TsCmKSiLU, _ffnDimension); + // TimeMixing sequence buffers + Workspace.DeclareSequence(SqAllR, _modelDimension); + Workspace.DeclareSequence(SqAllK, _modelDimension); + Workspace.DeclareSequence(SqAllV, _modelDimension); + Workspace.DeclareSequence(SqAllA, _modelDimension); + Workspace.DeclareSequence(SqAllB, _modelDimension); + Workspace.DeclareSequence(SqAllWkv, _modelDimension); + Workspace.DeclareSequence(SqAllWkvPre, _modelDimension); + Workspace.DeclareSequence(SqAllWkvGated, _modelDimension); + // ChannelMixing sequence buffers + Workspace.DeclareSequence(SqCmAllRGate, _modelDimension); + Workspace.DeclareSequence(SqCmAllVProj, _modelDimension); + Workspace.DeclareSequence(SqCmAllSiLU, _ffnDimension); + Workspace.DeclareSequence(SqCmAllKProj, _ffnDimension); + } + + public override LayerBase<T> Clone() + { + var clone = (RWKV7Block<T>)MemberwiseClone(); + clone.InitializeWorkspace(); + return clone; + } private void InitializeParameters() { // ... existing parameter init ... - // Arena workspace: pre-allocate forward pass buffers for zero-allocation hot path - Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12); - // ... all DeclareTimestep/DeclareSequence calls ... + InitializeWorkspace(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs` around lines 314 - 341, RWKV7Block<T> currently creates a LayerWorkspace<T> in InitializeParameters() but does not override Clone(), so LayerBase<T>.Clone()'s MemberwiseClone() leaves cloned instances sharing the same Workspace; fix by adding a private helper (e.g., InitializeWorkspace()) that performs the Workspace = new LayerWorkspace<T>(...) and DeclareTimestep/DeclareSequence calls (the same logic currently in InitializeParameters), call that helper from InitializeParameters() and the class constructor, and override Clone() in RWKV7Block<T> to call base.Clone(), cast to RWKV7Block<T>, then invoke InitializeWorkspace() on the cloned instance to replace the shared Workspace with a fresh one before returning the clone; ensure you reference the Workspace field and methods TsRInput/TsKInput/.../SqCmAllKProj so the same buffers are declared for the clone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: f7a8a0db-8133-48a9-93fe-1246db4c6a6d
📒 Files selected for processing (54)
src/NeuralNetworks/Layers/ALiBiPositionalBiasLayer.cssrc/NeuralNetworks/Layers/AttentionLayer.cssrc/NeuralNetworks/Layers/BatchEnsembleLayer.cssrc/NeuralNetworks/Layers/BidirectionalLayer.cssrc/NeuralNetworks/Layers/CapsuleLayer.cssrc/NeuralNetworks/Layers/ConditionalRandomFieldLayer.cssrc/NeuralNetworks/Layers/ConvLSTMLayer.cssrc/NeuralNetworks/Layers/DigitCapsuleLayer.cssrc/NeuralNetworks/Layers/DirectionalGraphLayer.cssrc/NeuralNetworks/Layers/EdgeConditionalConvolutionalLayer.cssrc/NeuralNetworks/Layers/FeatureTransformerLayer.cssrc/NeuralNetworks/Layers/GRULayer.cssrc/NeuralNetworks/Layers/GraphAttentionLayer.cssrc/NeuralNetworks/Layers/GraphConvolutionalLayer.cssrc/NeuralNetworks/Layers/GraphIsomorphismLayer.cssrc/NeuralNetworks/Layers/GraphTransformerLayer.cssrc/NeuralNetworks/Layers/HeterogeneousGraphLayer.cssrc/NeuralNetworks/Layers/HyperbolicLinearLayer.cssrc/NeuralNetworks/Layers/InteractingLayer.cssrc/NeuralNetworks/Layers/IntersampleAttentionLayer.cssrc/NeuralNetworks/Layers/MemoryWriteLayer.cssrc/NeuralNetworks/Layers/MessagePassingLayer.cssrc/NeuralNetworks/Layers/ObliviousDecisionTreeLayer.cssrc/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cssrc/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cssrc/NeuralNetworks/Layers/QuantumLayer.cssrc/NeuralNetworks/Layers/RepParameterizationLayer.cssrc/NeuralNetworks/Layers/SSM/ABCLayer.cssrc/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cssrc/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cssrc/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/HGRN2Layer.cssrc/NeuralNetworks/Layers/SSM/HedgehogLayer.cssrc/NeuralNetworks/Layers/SSM/HybridBlockScheduler.cssrc/NeuralNetworks/Layers/SSM/HyenaLayer.cssrc/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/LogLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/LonghornLayer.cssrc/NeuralNetworks/Layers/SSM/MegalodonLayer.cssrc/NeuralNetworks/Layers/SSM/MesaNetLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cssrc/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/RWKV7Block.cssrc/NeuralNetworks/Layers/SSM/RebasedLayer.cssrc/NeuralNetworks/Layers/SSM/RetNetLayer.cssrc/NeuralNetworks/Layers/SSM/RodimusLayer.cssrc/NeuralNetworks/Layers/SSM/S4DLayer.cssrc/NeuralNetworks/Layers/SSM/S5Layer.cssrc/NeuralNetworks/Layers/SSM/TTTLayer.cssrc/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cssrc/NeuralNetworks/Layers/SoftTreeLayer.cssrc/NeuralNetworks/Layers/SpatialTransformerLayer.cssrc/NeuralNetworks/Layers/SpikingLayer.cs
…dation, docs Critical correctness fixes: - Reverted 21 recurrent state buffer allocations back to new Tensor<T>() for guaranteed zero initialization (ExtendedLSTM, HGRN, MEGA, MinLSTM, MixtureOfMemories, MultiLatentAttention, RealGatedLinearRecurrence, Rebased gradient accumulators, S5, TransNormerLLM) - ForwardArena.Reset: avoid concurrent modification by copying Keys first - ForwardArena.RentUninitialized: no longer clears on grow path - ShapeKey: supports arbitrary rank (was limited to 4 dims) - LayerWorkspace: constructor and DeclareTimestep/Sequence validate bounds - RWKV7Block: moved BeginForward to main Forward() so ChannelMixing also works - RWKV7Block: removed unused TsYt buffer, cleaned up wkvOutput null check - LayerBase: fixed duplicate <summary> XML tag - Benchmark: includes BeginForward in workspace measurement Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Deployment failed with the following error: Learn More: https://vercel.com/franklins-projects-02a0b5a0?upgradeToPro=build-rate-limit |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 80 out of 80 changed files in this pull request and generated 19 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (6)
src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs (1)
527-530:⚠️ Potential issue | 🟠 MajorRemove remaining raw allocation in SiLU derivative hot path.
new Tensor<T>(...)here reintroduces avoidable allocation churn. Keep this path pooled to stay consistent with the PR’s zero-allocation/perf goal.♻️ Suggested fix
private Tensor<T> ComputeSiLUDerivative(Tensor<T> x) { var sig = Engine.Sigmoid(x); var oneMinusSig = Engine.ScalarMinusTensor(NumOps.One, sig); var xTimesOneMinusSig = Engine.TensorMultiply(x, oneMinusSig); - var ones_t = new Tensor<T>(xTimesOneMinusSig.Shape.ToArray()); ones_t.Fill(NumOps.One); - var onePlusXSig = Engine.TensorAdd(ones_t, xTimesOneMinusSig); + var ones = TensorAllocator.Rent<T>(xTimesOneMinusSig.Shape.ToArray()); + ones.Fill(NumOps.One); + var onePlusXSig = Engine.TensorAdd(ones, xTimesOneMinusSig); return Engine.TensorMultiply(sig, onePlusXSig); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs` around lines 527 - 530, Replace the raw allocation new Tensor<T>(...) used to create ones_t with the tensor-pooling API so the SiLU derivative hot path stays zero-allocation: instead of "var ones_t = new Tensor<T>(xTimesOneMinusSig.Shape.ToArray()); ones_t.Fill(NumOps.One);" acquire a pooled tensor with the same shape (using the Engine method that returns a pooled tensor for a given shape or a "like" tensor, e.g. Engine.TensorLike/Engine.CreatePooledTensor/Engine.AcquireTensorWithShape of xTimesOneMinusSig), call Fill(NumOps.One) on that pooled tensor, use it in the existing Engine.TensorAdd to produce onePlusXSig, and ensure you follow the layer's tensor lifetime conventions (release/Dispose the pooled tensor if required) to avoid reintroducing allocation churn.src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs (1)
288-288:⚠️ Potential issue | 🔴 CriticalBlocking: zero-initialize rented recurrent state
h.Line 318 reads
hbefore any write in timestep 0. WithTensorAllocator.Rent, that can pull stale values and corrupt recurrence determinism.🐛 Suggested fix
- var h = TensorAllocator.Rent<T>(new[] { batchSize, _recurrenceDimension }); + var h = TensorAllocator.Rent<T>(new[] { batchSize, _recurrenceDimension }); + h.Fill(NumOps.Zero);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs` at line 288, The rented recurrent state h (allocated with TensorAllocator.Rent<T>(...)) can contain stale memory and must be zero-initialized before use in RealGatedLinearRecurrenceLayer; change the allocation site that creates h to zero it immediately after Rent (e.g., call the allocator's zeroing variant or invoke a Clear/Fill(0) on h) so timestep 0 doesn't read uninitialized values and recurrence determinism is preserved.src/NeuralNetworks/Layers/SSM/HGRNLayer.cs (1)
320-320:⚠️ Potential issue | 🔴 CriticalBlocking: initialize rented hidden state to zero before recurrence.
Line 335 consumes
has previous state on timestep 0. With pooled memory, that can be stale unless explicitly zeroed.🐛 Suggested fix
- var h = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }); + var h = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }); + h.Fill(NumOps.Zero);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/HGRNLayer.cs` at line 320, The rented tensor h (created via TensorAllocator.Rent<T>) in HGRNLayer.cs may contain stale pooled memory and must be zero-initialized before being used as the previous hidden state on timestep 0; after calling TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }) set all elements of h to zero (e.g., a zero-fill or Clear/Fill method appropriate to the tensor type) immediately after allocation and before the recurrence loop, so the initial hidden state is deterministic.src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs (1)
302-302:⚠️ Potential issue | 🔴 CriticalBlocking: clear
h_tafter rent before additive accumulation.Line 346 accumulates into
h_t(+=semantics). Without zero-init, pooled residue contaminates outputs.🐛 Suggested fix
- var h_t = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }); + var h_t = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }); + h_t.Fill(NumOps.Zero);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs` at line 302, The rented tensor h_t from TensorAllocator.Rent<T>(...) must be zero-initialized before any additive accumulation (you later do += into h_t), so immediately after var h_t = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }); clear its contents to default/zero; do this by calling the allocator/ tensor clear helper (e.g., TensorAllocator.Clear(h_t) or h_t.Fill(default(T))) or, if the tensor exposes a buffer, Array.Clear(h_t.Buffer, 0, h_t.Length) so pooled residue cannot contaminate later += operations.src/NeuralNetworks/Layers/SSM/S5Layer.cs (1)
757-793:⚠️ Potential issue | 🟠 MajorDo not make these gradient accumulators depend on allocator clearing.
dU,dhReal, anddhImagare accumulation buffers. The first writes at Lines 817, 826, 828, and 904 are additive, so this change is only safe ifTensorAllocator.Rent<T>()is guaranteed to return cleared memory. Zero them locally afterRent()or keep raw allocation here.Minimal safe change
var dU = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _modelDimension }); +dU.Fill(NumOps.Zero); ... var dhReal = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension }); var dhImag = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension }); +dhReal.Fill(NumOps.Zero); +dhImag.Fill(NumOps.Zero);As per coding guidelines, "Every PR must contain production-ready code."
Run this to confirm what
TensorAllocator.Rent<T>()guarantees today:#!/bin/bash set -euo pipefail allocator_file="$(fd -i 'TensorAllocator.cs' src | head -n1)" printf 'TensorAllocator file: %s\n\n' "$allocator_file" rg -n -C4 'class TensorAllocator|Rent<|Clear|Fill|Zero' "$allocator_file" printf '\nAccumulator sites in S5Layer:\n\n' sed -n '750,910p' src/NeuralNetworks/Layers/SSM/S5Layer.cs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs` around lines 757 - 793, The buffers dU, dhReal, and dhImag are used as accumulation buffers but are obtained via TensorAllocator.Rent<T>(), which may return non-zero memory; zero them immediately after renting to avoid relying on allocator clearing. Locate the Rent calls for dU, dhReal, and dhImag in S5Layer (references: dU, dhReal, dhImag, TensorAllocator.Rent<T>()) and call a local zeroing operation (e.g., fill/clear the rented tensor memory to 0) before any additive writes occur, or switch to a raw/new allocation that guarantees zeroed contents.src/Memory/LayerWorkspace.cs (1)
32-33:⚠️ Potential issue | 🔴 CriticalBlocking: validate sizes and slot indices before the fast path.
A fresh
BeginForward(0, 0)returns immediately because_lastBatchSizeand_lastSeqLenalso start at0, so nothing gets allocated and the next accessor fails later withInvalidOperationException.Timestep()andSequence()also surface rawIndexOutOfRangeExceptionfor bad slots instead of a stable contract.Minimal guardrails
- private int _lastBatchSize; - private int _lastSeqLen; + private int _lastBatchSize = -1; + private int _lastSeqLen = -1; ... public void BeginForward(int batchSize, int seqLen = 1) { + if (batchSize <= 0) + throw new ArgumentOutOfRangeException(nameof(batchSize)); + if (seqLen <= 0) + throw new ArgumentOutOfRangeException(nameof(seqLen)); if (batchSize == _lastBatchSize && seqLen == _lastSeqLen) return; ... - public Tensor<T> Timestep(int index) => _timestepBuffers[index] + public Tensor<T> Timestep(int index) + { + if ((uint)index >= (uint)_timestepBuffers.Length) + throw new ArgumentOutOfRangeException(nameof(index)); + return _timestepBuffers[index] ?? throw new InvalidOperationException($"Timestep buffer {index} not allocated. Call BeginForward first."); + } ... - public Tensor<T> Sequence(int index) => _sequenceBuffers[index] + public Tensor<T> Sequence(int index) + { + if ((uint)index >= (uint)_sequenceBuffers.Length) + throw new ArgumentOutOfRangeException(nameof(index)); + return _sequenceBuffers[index] ?? throw new InvalidOperationException($"Sequence buffer {index} not allocated. Call BeginForward first."); + }As per coding guidelines, "Every PR must contain production-ready code" and production-ready code must not ship with "missing validation of external inputs".
Also applies to: 75-119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Memory/LayerWorkspace.cs` around lines 32 - 33, BeginForward currently fast-paths when batch/seq equal _lastBatchSize/_lastSeqLen which allows BeginForward(0,0) to return without allocations and later accessors throw; add explicit validation at the start of BeginForward(batchSize, seqLen) to reject non-positive sizes or to allocate/initialize state when either argument differs from _lastBatchSize/_lastSeqLen (update _lastBatchSize/_lastSeqLen only after successful allocation). Also add bounds checks in Timestep(int slot) and Sequence(int slot) to verify slot is within [0, _lastSeqLen) (or the correct per-batch/per-seq range) and throw a stable InvalidOperationException with a clear message if invalid instead of letting IndexOutOfRangeException surface; reference these symbols: BeginForward, _lastBatchSize, _lastSeqLen, Timestep, Sequence, and the accessor that reads slots.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs`:
- Around line 72-75: Update the stale XML doc comment in
ArenaAllocationBenchmarks.cs that describes the Baseline allocation count:
recalc using TimestepsPerForward (which is 32) and change the text to state
"Creates 8 sequence + 7×32 timestep tensors = 232 allocations per forward pass"
(or equivalent wording and formula showing 8 + 7×TimestepsPerForward = 232) so
the comment matches the actual benchmark parameters; locate the comment near the
ArenaAllocationBenchmarks class/method where TimestepsPerForward is referenced
and replace the incorrect 904 value and formula.
In `@src/Memory/ForwardArena.cs`:
- Around line 14-15: Change the public API surface by making the
allocator/internal memory types non-public: change the ForwardArena<T> class and
the ShapeKey type from public to internal (and any other occurrences at the
referenced location) so they are not exposed to consumers; ensure any
constructors, methods, or nested types remain internal, and if external tests or
assemblies need access, grant it via InternalsVisibleTo rather than keeping
those types public.
- Around line 137-159: The ShapeKey constructor currently stores the incoming
mutable array reference (_dims = shape) which allows external mutation to break
dictionary behavior; change the constructor in ShapeKey to make a defensive copy
of the provided int[] (e.g., allocate a new int[shape.Length] and copy elements
or use shape.Clone/Array.Copy) and assign that copy to _dims, and also validate
shape is not null; leave GetHashCode(), Equals(object), and Equals(ShapeKey)
as-is so hashing and equality operate on the immutable internal copy.
- Around line 72-77: The Reset method allocates by creating a temporary List
from _cursors.Keys; instead iterate the dictionary's key collection directly to
avoid the allocation. Modify Reset (method Reset) to loop over _cursors.Keys
(the Dictionary<ShapeKey, ...>.KeyCollection) and set _cursors[key] = 0 for each
key, ensuring you do not add/remove keys (only update values) so the key
collection iteration remains valid and allocation-free.
In `@src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs`:
- Around line 536-540: The current SiLU-derivative code allocates a full-size
ones tensor via ones_t and onePlusXSig; instead compute sig * (1 + x*(1-sig))
algebraically as sig + x * sig * (1 - sig) to avoid the allocation: keep using
sig and oneMinusSig (from Engine.ScalarMinusTensor(NumOps.One, sig)), compute
temp = sig * oneMinusSig (using Engine.TensorMultiply), then xTimesTemp = x *
temp, and finally return Engine.TensorAdd(sig, xTimesTemp); update the code
around variables sig, oneMinusSig, x, TensorMultiply and TensorAdd to follow
this allocation-free sequence.
In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 671-673: ApplyGroupNorm currently allocates a new shape array each
call via input.Shape.ToArray(), reintroducing GC pressure in the per-token
TimeMixingForward loop; change ApplyGroupNorm to accept a precomputed/timestep
workspace shape or a pre-rented output Tensor from a caller-managed timestep
buffer (or expose a cached shape array) so it calls TensorAllocator.Rent with
that cached shape instead of input.Shape.ToArray(), and update callers (e.g.,
TimeMixingForward) to pass the workspace buffer or precomputed shape; keep
references: ApplyGroupNorm, TimeMixingForward, TensorAllocator.Rent, and
input.Shape.ToArray().
---
Duplicate comments:
In `@src/Memory/LayerWorkspace.cs`:
- Around line 32-33: BeginForward currently fast-paths when batch/seq equal
_lastBatchSize/_lastSeqLen which allows BeginForward(0,0) to return without
allocations and later accessors throw; add explicit validation at the start of
BeginForward(batchSize, seqLen) to reject non-positive sizes or to
allocate/initialize state when either argument differs from
_lastBatchSize/_lastSeqLen (update _lastBatchSize/_lastSeqLen only after
successful allocation). Also add bounds checks in Timestep(int slot) and
Sequence(int slot) to verify slot is within [0, _lastSeqLen) (or the correct
per-batch/per-seq range) and throw a stable InvalidOperationException with a
clear message if invalid instead of letting IndexOutOfRangeException surface;
reference these symbols: BeginForward, _lastBatchSize, _lastSeqLen, Timestep,
Sequence, and the accessor that reads slots.
In `@src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs`:
- Line 302: The rented tensor h_t from TensorAllocator.Rent<T>(...) must be
zero-initialized before any additive accumulation (you later do += into h_t), so
immediately after var h_t = TensorAllocator.Rent<T>(new[] { batchSize,
_modelDimension }); clear its contents to default/zero; do this by calling the
allocator/ tensor clear helper (e.g., TensorAllocator.Clear(h_t) or
h_t.Fill(default(T))) or, if the tensor exposes a buffer,
Array.Clear(h_t.Buffer, 0, h_t.Length) so pooled residue cannot contaminate
later += operations.
In `@src/NeuralNetworks/Layers/SSM/HGRNLayer.cs`:
- Line 320: The rented tensor h (created via TensorAllocator.Rent<T>) in
HGRNLayer.cs may contain stale pooled memory and must be zero-initialized before
being used as the previous hidden state on timestep 0; after calling
TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }) set all elements
of h to zero (e.g., a zero-fill or Clear/Fill method appropriate to the tensor
type) immediately after allocation and before the recurrence loop, so the
initial hidden state is deterministic.
In `@src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs`:
- Line 288: The rented recurrent state h (allocated with
TensorAllocator.Rent<T>(...)) can contain stale memory and must be
zero-initialized before use in RealGatedLinearRecurrenceLayer; change the
allocation site that creates h to zero it immediately after Rent (e.g., call the
allocator's zeroing variant or invoke a Clear/Fill(0) on h) so timestep 0
doesn't read uninitialized values and recurrence determinism is preserved.
In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs`:
- Around line 757-793: The buffers dU, dhReal, and dhImag are used as
accumulation buffers but are obtained via TensorAllocator.Rent<T>(), which may
return non-zero memory; zero them immediately after renting to avoid relying on
allocator clearing. Locate the Rent calls for dU, dhReal, and dhImag in S5Layer
(references: dU, dhReal, dhImag, TensorAllocator.Rent<T>()) and call a local
zeroing operation (e.g., fill/clear the rented tensor memory to 0) before any
additive writes occur, or switch to a raw/new allocation that guarantees zeroed
contents.
In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs`:
- Around line 527-530: Replace the raw allocation new Tensor<T>(...) used to
create ones_t with the tensor-pooling API so the SiLU derivative hot path stays
zero-allocation: instead of "var ones_t = new
Tensor<T>(xTimesOneMinusSig.Shape.ToArray()); ones_t.Fill(NumOps.One);" acquire
a pooled tensor with the same shape (using the Engine method that returns a
pooled tensor for a given shape or a "like" tensor, e.g.
Engine.TensorLike/Engine.CreatePooledTensor/Engine.AcquireTensorWithShape of
xTimesOneMinusSig), call Fill(NumOps.One) on that pooled tensor, use it in the
existing Engine.TensorAdd to produce onePlusXSig, and ensure you follow the
layer's tensor lifetime conventions (release/Dispose the pooled tensor if
required) to avoid reintroducing allocation churn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 27f7164a-d448-401f-8e50-caeb21a66af7
📒 Files selected for processing (15)
AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cssrc/Memory/ForwardArena.cssrc/Memory/LayerWorkspace.cssrc/NeuralNetworks/Layers/LayerBase.cssrc/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cssrc/NeuralNetworks/Layers/SSM/HGRNLayer.cssrc/NeuralNetworks/Layers/SSM/MEGALayer.cssrc/NeuralNetworks/Layers/SSM/MinLSTMLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cssrc/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/RWKV7Block.cssrc/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cssrc/NeuralNetworks/Layers/SSM/RebasedLayer.cssrc/NeuralNetworks/Layers/SSM/S5Layer.cssrc/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs
…leanup SiLU derivative optimization (18 files): - Replaced ones_t tensor allocation + Fill(One) + TensorAdd pattern with Engine.TensorAddScalar(x, NumOps.One) — zero allocation for 1+x computation - ABCLayer, GatedDeltaNet, GatedDeltaProduct, GatedSlotAttention, HGRN2, Hedgehog, Hyena, Kimi, LogLinear, MesaNet, MixtureOfMamba/Memories, MultiLatentAttention, Rebased, RetNet, Rodimus, TransNormerLLM, TTT ForwardArena fixes: - Reset() uses pre-cached keys array instead of allocating List each call - ShapeKey defensive-copies the shape array to prevent mutation - Made ForwardArena internal (implementation detail, not public API) Other fixes: - RWKV7Block: removed unused TsYt constant - DiffusionConvLayer: removed duplicate using directive - S6Scan: eliminated wasted Rent in initialState branch (clone instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
src/NeuralNetworks/Layers/SSM/RodimusLayer.cs (1)
485-499:⚠️ Potential issue | 🔴 CriticalBlocking: remove the unused
CreateOnesLikehelper.The SiLU rewrite at Lines 488-490 makes Lines 494-499 dead code. Remove it before merge.
🧹 Suggested cleanup
- private Tensor<T> CreateOnesLike(Tensor<T> template) - { - var ones = new Tensor<T>(template.Shape.ToArray()); - ones.Fill(NumOps.One); - return ones; - }As per coding guidelines, "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs` around lines 485 - 499, Remove the dead CreateOnesLike helper: delete the private method CreateOnesLike(Tensor<T> template) since ComputeSiLUDerivative now computes SiLU without it; search for any remaining references to CreateOnesLike and remove or replace them, and run build/tests to ensure no other code depended on it (also remove any now-unused imports or fields that existed solely for that helper).src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs (1)
339-366:⚠️ Potential issue | 🟠 MajorThis forward path still allocates inside
LightningAttentionForward.Lines 340-366 remove some intermediates, but
Forward()still reachesnew Tensor<T>()forstateandallStatesat Lines 407-408 on every call. That keeps this layer off the zero-allocation path this PR is targeting.♻️ Suggested follow-up
- var state = new Tensor<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension }); - var allStates = new Tensor<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension }); + var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension }); + var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension }); + state.Fill(NumOps.Zero); + for (int bi = 0; bi < batchSize; bi++) + for (int hi = 0; hi < _numHeads; hi++) + for (int di = 0; di < _headDimension; di++) + for (int dj = 0; dj < _headDimension; dj++) + allStates[new[] { bi, 0, hi, di, dj }] = NumOps.Zero;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs` around lines 339 - 366, The forward path still allocates inside LightningAttentionForward (Forward calls new Tensor<T> for state and allStates), so change LightningAttentionForward to avoid internal allocations by adding parameters (e.g., out or preallocated Tensor parameters like stateOut, allStatesOut) or accept rented buffers and write into them; update Forward to rent those tensors via TensorAllocator.Rent<T>(...) and pass them into LightningAttentionForward (or provide a reusable per-batch buffer on the layer) so no new Tensor<T>() is created inside LightningAttentionForward; ensure _last... fields are set to the rented/result tensors and release/return buffers consistently.src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs (1)
388-404:⚠️ Potential issue | 🔴 CriticalBlocking: drop the leftover
CreateOnesLikehelper.The scalar-based SiLU rewrite made Lines 399-404 unreachable. Remove the helper in the same change.
🧹 Suggested cleanup
- private Tensor<T> CreateOnesLike(Tensor<T> template) - { - var ones = new Tensor<T>(template.Shape.ToArray()); - ones.Fill(NumOps.One); - return ones; - }As per coding guidelines, "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs` around lines 388 - 404, Remove the now-unused CreateOnesLike helper method: delete the private Tensor<T> CreateOnesLike(Tensor<T> template) method (the unused symbol CreateOnesLike) from the file so no dead/unreachable code remains after the scalar-based SiLU rewrite in ComputeSiLUDerivative; also search for and remove any remaining references to CreateOnesLike to ensure no callers remain.src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs (1)
442-458:⚠️ Potential issue | 🔴 CriticalBlocking: remove the unused
CreateOnesLikehelper.After the scalar-based rewrite at Lines 447-449, the private helper at Lines 453-458 is dead code. Keep the cleanup in the same refactor.
🧹 Suggested cleanup
- private Tensor<T> CreateOnesLike(Tensor<T> template) - { - var ones = new Tensor<T>(template.Shape.ToArray()); - ones.Fill(NumOps.One); - return ones; - }As per coding guidelines, "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs` around lines 442 - 458, Remove the dead helper CreateOnesLike by deleting the private method CreateOnesLike(Tensor<T> template) and any references to it; the ComputeSiLUDerivative method now uses scalar ops and no longer needs this helper, so ensure there are no remaining calls to CreateOnesLike and run tests/compile to confirm cleanup.src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs (1)
607-621:⚠️ Potential issue | 🔴 CriticalBlocking: remove the orphaned
CreateOnesLikehelper.The SiLU refactor no longer calls this private method, so the file now ships dead code from the change. Production-ready code should delete the orphaned helper instead of leaving refactoring leftovers behind.
🧹 Suggested cleanup
- private Tensor<T> CreateOnesLike(Tensor<T> template) - { - var ones = new Tensor<T>(template.Shape.ToArray()); - ones.Fill(NumOps.One); - return ones; - } - `#region` Parameter ManagementAs per coding guidelines, "Production Readiness (CRITICAL - Flag as BLOCKING)" includes "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs` around lines 607 - 621, Remove the unused helper CreateOnesLike from the file to eliminate dead code introduced during the SiLU refactor: delete the private method CreateOnesLike(Tensor<T> template) and any references (none expected) so only ComputeSiLUDerivative and active helpers remain; ensure no compilation or style issues remain after removal and run tests/build to verify.src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs (1)
465-482:⚠️ Potential issue | 🔴 CriticalBlocking: delete the unused
CreateOnesLikehelper.After switching
ComputeSiLUDerivative()to scalar/tensor ops, this private method is dead code. Production-ready code should remove the stale helper instead of leaving refactoring debris in the layer.🧹 Suggested cleanup
- private Tensor<T> CreateOnesLike(Tensor<T> template) - { - var ones = new Tensor<T>(template.Shape.ToArray()); - ones.Fill(NumOps.One); - return ones; - } - private Tensor<T> DepthwiseConv1DBackward(As per coding guidelines, "Production Readiness (CRITICAL - Flag as BLOCKING)" includes "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs` around lines 465 - 482, Remove the dead helper CreateOnesLike (private Tensor<T> CreateOnesLike(Tensor<T> template)) from the file since ComputeSiLUDerivative now uses scalar/tensor ops and there are no callers; delete the entire method (including the local ones variable and Fill call) and ensure no other code references CreateOnesLike remain.
♻️ Duplicate comments (6)
src/NeuralNetworks/Layers/SSM/RodimusLayer.cs (1)
386-388:⚠️ Potential issue | 🔴 CriticalInitialize
stateand thet=0snapshot after renting.Line 446 reads
statebefore any write, and Lines 472-477 only saveallStatesatt + 1. With pooled buffers, both the recurrence base state and_lastStates[..., 0, ...]become stale.🔧 Suggested fix
var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension }); var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension }); var selectionWeightsCache = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _numHeads, _headDimension }); + state.Fill(NumOps.Zero); + for (int bi = 0; bi < batchSize; bi++) + for (int hi = 0; hi < _numHeads; hi++) + for (int di = 0; di < _headDimension; di++) + for (int ki = 0; ki < _headDimension; ki++) + allStates[new[] { bi, 0, hi, di, ki }] = NumOps.Zero;Also applies to: 446-477
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs` around lines 386 - 388, The rented buffers state, allStates and selectionWeightsCache are used before being initialized; after calling TensorAllocator.Rent<T>(...) you must initialize state (e.g. zero or the proper initial recurrence state) and immediately write the t=0 snapshot into allStates at index 0 so _lastStates/.../0/... isn't left stale; update the code around TensorAllocator.Rent usage (symbols: state, allStates, selectionWeightsCache, TensorAllocator.Rent) to set state to the correct initial values and copy/assign that state into allStates[..., 0, ...] before the time-loop and before any reads.src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs (1)
296-327:⚠️ Potential issue | 🔴 CriticalZero-initialize
expertStatesbefore the scan.Lines 314-327 read
expertStates[..., t, si], but after Line 296 switched that buffer toTensorAllocator.Rent<T>(...)many expert/timestep cells no longer start at zero. Newly activated experts can now pick up stale pooled state.🔧 Suggested fix
- var expertStates = TensorAllocator.Rent<T>(new[] { batchSize, _numExperts, seqLen + 1, _stateDimension }); + var expertStates = TensorAllocator.Rent<T>(new[] { batchSize, _numExperts, seqLen + 1, _stateDimension }); + expertStates.Fill(NumOps.Zero);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs` around lines 296 - 327, After renting the buffer with TensorAllocator.Rent<T> for expertStates in MixtureOfMambaLayer, zero-initialize the entire tensor before the scan loop to avoid stale pooled values being read; e.g., immediately after expertStates = TensorAllocator.Rent<T>(...) clear all entries to NumOps.Zero (via a provided TensorAllocator.Clear/Fill utility or a tight nested loop over batchSize, _numExperts, seqLen+1, _stateDimension) so reads like expertStates[new[] { bi, expertIdx, t, si }] see zeros for uninitialized experts.src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs (1)
324-326:⚠️ Potential issue | 🔴 CriticalZero
stateand seed thet=0snapshot.Line 351 reads
statebefore any write, and Lines 376-381 only populateallStatesatt + 1. On a rented buffer that makes both the recurrence base state and_lastStates[..., 0, ...]nondeterministic.🔧 Suggested fix
var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension }); var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension }); + state.Fill(NumOps.Zero); + for (int bi = 0; bi < batchSize; bi++) + for (int hi = 0; hi < _numHeads; hi++) + for (int di = 0; di < _headDimension; di++) + for (int vi = 0; vi < _headDimension; vi++) + allStates[new[] { bi, 0, hi, di, vi }] = NumOps.Zero;Also applies to: 351-381
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs` around lines 324 - 326, The rented buffer "state" and the snapshot array "allStates" must be initialized to deterministic values and the t=0 snapshot seeded so subsequent reads (e.g., where "state" is read before being written and where "_lastStates[..., 0, ...]" is expected) are well-defined; modify the allocation code that uses TensorAllocator.Rent<T>(...) to clear or fill "state" with zeros (or default(T)) immediately after rental and copy that initial zeroed "state" into allStates at index t=0 (and into _lastStates slice for t=0) before entering the recurrence loop so the recurrence base and stored _lastStates[..., 0, ...] are deterministic. Ensure you reference the variables "state", "allStates", "_lastStates", and the use of TensorAllocator.Rent<T>" when making the change.src/Memory/ForwardArena.cs (1)
136-136:⚠️ Potential issue | 🟠 MajorKeep
ShapeKeyoff the public surface.This is allocator plumbing, not a facade-facing type. Leaving it public makes low-level memory internals part of the supported API for no consumer benefit.
🔒 Suggested change
-public readonly struct ShapeKey : IEquatable<ShapeKey> +internal readonly struct ShapeKey : IEquatable<ShapeKey>As per coding guidelines, "Users should ONLY interact with
AiModelBuilder.csandAiModelResult.cs" and "Helper classes, utilities, and implementation details should beinternal."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/Memory/ForwardArena.cs` at line 136, The ShapeKey type is exposed publicly but is an allocator/internal implementation detail; change its accessibility to internal by updating the declaration of ShapeKey (the struct named ShapeKey in ForwardArena) from public to internal (i.e., make it "internal readonly struct ShapeKey : IEquatable<ShapeKey>") and then rebuild and fix any compilation issues (if external tests or other assemblies need access, use InternalsVisibleTo rather than making ShapeKey public); ensure no public API signatures rely on ShapeKey so the public surface stays unchanged.src/NeuralNetworks/Layers/SSM/RetNetLayer.cs (1)
381-382:⚠️ Potential issue | 🟠 MajorZero the causal-mask gap in
retentionScores.Only the
j <= iregion is written below. With a rented tensor, the cachedj > ientries keep stale values, and_lastRetentionScoresthen exposes that garbage to backward passes.🧹 Suggested fix
var retentionScores = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, seqLen, seqLen }); + for (int bi = 0; bi < batchSize; bi++) + for (int h = 0; h < _numHeads; h++) + for (int i = 0; i < seqLen; i++) + for (int j = i + 1; j < seqLen; j++) + retentionScores[new[] { bi, h, i, j }] = NumOps.Zero; _lastRetentionScores = retentionScores;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RetNetLayer.cs` around lines 381 - 382, The rented tensor retentionScores (from TensorAllocator.Rent<T>(...)) contains stale data in entries where j>i because only the causal lower triangle is written later; zero out the causal-mask gap before storing it in _lastRetentionScores. Locate the allocation around retentionScores and either initialize the full tensor to zero or explicitly zero the strict upper-triangle (indices where j>i for dimensions batchSize, _numHeads, seqLen, seqLen) so _lastRetentionScores does not expose stale values to backward passes.src/NeuralNetworks/Layers/SSM/RWKV7Block.cs (1)
673-673:⚠️ Potential issue | 🟠 Major
input.Shape.ToArray()still allocates inside per-token normalization.
ApplyGroupNorm()runs once per token inTimeMixingForward(), andGroupNormBackward()repeats the same pattern during training. The tensor bodies come fromTensorAllocator, but these fresh shape arrays keep the path from being allocation-free. Pass a caller-owned output/workspace buffer instead of renting frominput.Shape.ToArray()inside the methods.Also applies to: 926-926
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs` at line 673, ApplyGroupNorm and GroupNormBackward allocate a fresh shape array per token via TensorAllocator.Rent<T>(input.Shape.ToArray()) called inside TimeMixingForward; change these methods to accept a caller-owned output/workspace Tensor (or Span) parameter so the caller (TimeMixingForward) rents a buffer once and passes it in, eliminating the input.Shape.ToArray() allocation. Update the signatures of ApplyGroupNorm and GroupNormBackward to take the output/workspace buffer and modify their call sites in TimeMixingForward (and the training path) to reuse the pre-rented buffer instead of calling Rent with input.Shape.ToArray(); ensure both occurrences (the forward call and the backward call) are updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Memory/ForwardArena.cs`:
- Around line 141-149: The ShapeKey constructor always clones the input array
causing allocations on every lookup (used by Rent() and RentUninitialized());
change the design so lookup keys don't clone: introduce a second constructor or
a static factory (e.g., ShapeKey.FromLookup(int[] shape) or ShapeKey(bool copy,
int[] shape)) and use it in Rent()/RentUninitialized() to create a
non-allocating key, while keeping the existing cloning behavior only when
creating a key that will be stored in the dictionary (e.g., when inserting a
brand-new slab); update dictionary insert paths to clone into the stored
ShapeKey and keep lookup paths non-copying to avoid per-lookup allocations.
In `@src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs`:
- Around line 346-347: The rented buffers state and allStates are uninitialized
and must be zeroed to avoid corrupting the first S*k update and the saved
history; after calling TensorAllocator.Rent<T>(...) for state and allStates in
GatedDeltaNetLayer (the two vars named state and allStates), immediately
clear/zero their contents (e.g., set all elements to default(T)/0) before any
reads or before the S * k path uses state and before recording allStates[..., 0,
...]; ensure you use the tensor/allocator API to zero the entire buffers so the
initial fast-weight matrix and initial snapshot are truly zeros.
In `@src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs`:
- Around line 437-438: The rented recurrent buffers 'state' and 'allStates' must
be zero-initialized immediately after allocation to avoid reading stale pooled
memory (they serve as S_{t-1} on first step and allStates[...,0,...] as the
cached initial state); after calling TensorAllocator.Rent<T>(...) for these
variables in GatedDeltaProductLayer.cs, set their contents to zero (either by
using a zeroing allocator API like RentZeroed<T> if available or by calling the
tensor clear/fill-to-zero method on 'state' and 'allStates') so the recurrence
starts from a guaranteed zero state.
In `@src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs`:
- Around line 283-288: The rented tensors returned by TensorAllocator.Rent<T>
(state, allStates, _lastKVGate, _lastKVGateRaw) must be zero-initialized because
state is read on the first timestep and allStates[:,0,...] holds the cached S0;
after switching from new Tensor<T> they may contain stale pool data. Fix by
immediately zeroing each rented tensor right after the Rent<T> call (e.g., call
the tensor's clear/zero-fill method or otherwise fill with default(T)) so
forward outputs and cached values are deterministic before any reads in the
methods that use state, allStates, _lastKVGate, and _lastKVGateRaw.
In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 785-789: The initialization of x_prev in RWKV7Block.cs when t == 0
must be zero-initialized to avoid gradients depending on stale pooled data;
replace the TensorAllocator.Rent<T>(...) call for x_prev with a guaranteed-zero
allocation (e.g., new Tensor<T>(new[] { batchSize, _modelDimension }) or call a
clear/zero method on the rented tensor) so the first-token path that reads
x_prev sees zeros; update the x_prev allocation in the t == 0 branch (symbol:
x_prev, TensorAllocator.Rent<T>, Tensor<T>, t == 0) accordingly and ensure any
rented tensor is zeroed before use.
In `@src/NeuralNetworks/Layers/SSM/S6Scan.cs`:
- Line 243: The rented accumulators dh, d_B_t, and d_C_t are left uninitialized
and get stale data from the pool; before the first Add in the backward pass you
must zero them after renting (or use a rent-zero helper) so accumulation is
correct. Locate the Rent<T> calls that create dh (and the similar rents for
d_B_t and d_C_t around the referenced areas) and immediately clear their buffers
(e.g., fill with zeros or call an allocator RentZeroed equivalent) before any
Add/accumulate operations in the backward methods so the gradients are pure sums
of current contributions.
- Around line 101-103: Validate that initialState (when provided) has shape
[batch, innerDim, stateDim] before cloning; if initialState is null or has
incorrect shape, initialize h to a correctly shaped zero/neutral tensor of shape
determined from batch/innerDim/stateDim. Always copy h into the first time-slice
of allHiddenStates (allHiddenStates[:,0,...]) unconditionally so the rented
buffer cannot retain stale pool data, and ensure the same validation/assignment
logic is applied to the alternate path handled near the usage of h (the code
around creation/copy into h and the subsequent writes to allHiddenStates).
- Around line 212-213: The recomputation seed uses a fresh buffer h_recomp
instead of the recorded initial state, so change the allocation/assignment in
SequentialScanForward(): allocate or rent h_recomp with the same shape but then
copy/clone the recorded initial hidden state (hiddenStates at time index 0 /
hiddenStates[:,0,...]) into h_recomp before assigning recomputedStates[0];
ensure shapes match and preserve device/dtype semantics when cloning to avoid
computing gradients from a zeroed initial state.
---
Outside diff comments:
In `@src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs`:
- Around line 465-482: Remove the dead helper CreateOnesLike (private Tensor<T>
CreateOnesLike(Tensor<T> template)) from the file since ComputeSiLUDerivative
now uses scalar/tensor ops and there are no callers; delete the entire method
(including the local ones variable and Fill call) and ensure no other code
references CreateOnesLike remain.
In `@src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs`:
- Around line 607-621: Remove the unused helper CreateOnesLike from the file to
eliminate dead code introduced during the SiLU refactor: delete the private
method CreateOnesLike(Tensor<T> template) and any references (none expected) so
only ComputeSiLUDerivative and active helpers remain; ensure no compilation or
style issues remain after removal and run tests/build to verify.
In `@src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs`:
- Around line 442-458: Remove the dead helper CreateOnesLike by deleting the
private method CreateOnesLike(Tensor<T> template) and any references to it; the
ComputeSiLUDerivative method now uses scalar ops and no longer needs this
helper, so ensure there are no remaining calls to CreateOnesLike and run
tests/compile to confirm cleanup.
In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs`:
- Around line 388-404: Remove the now-unused CreateOnesLike helper method:
delete the private Tensor<T> CreateOnesLike(Tensor<T> template) method (the
unused symbol CreateOnesLike) from the file so no dead/unreachable code remains
after the scalar-based SiLU rewrite in ComputeSiLUDerivative; also search for
and remove any remaining references to CreateOnesLike to ensure no callers
remain.
In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs`:
- Around line 485-499: Remove the dead CreateOnesLike helper: delete the private
method CreateOnesLike(Tensor<T> template) since ComputeSiLUDerivative now
computes SiLU without it; search for any remaining references to CreateOnesLike
and remove or replace them, and run build/tests to ensure no other code depended
on it (also remove any now-unused imports or fields that existed solely for that
helper).
In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs`:
- Around line 339-366: The forward path still allocates inside
LightningAttentionForward (Forward calls new Tensor<T> for state and allStates),
so change LightningAttentionForward to avoid internal allocations by adding
parameters (e.g., out or preallocated Tensor parameters like stateOut,
allStatesOut) or accept rented buffers and write into them; update Forward to
rent those tensors via TensorAllocator.Rent<T>(...) and pass them into
LightningAttentionForward (or provide a reusable per-batch buffer on the layer)
so no new Tensor<T>() is created inside LightningAttentionForward; ensure
_last... fields are set to the rented/result tensors and release/return buffers
consistently.
---
Duplicate comments:
In `@src/Memory/ForwardArena.cs`:
- Line 136: The ShapeKey type is exposed publicly but is an allocator/internal
implementation detail; change its accessibility to internal by updating the
declaration of ShapeKey (the struct named ShapeKey in ForwardArena) from public
to internal (i.e., make it "internal readonly struct ShapeKey :
IEquatable<ShapeKey>") and then rebuild and fix any compilation issues (if
external tests or other assemblies need access, use InternalsVisibleTo rather
than making ShapeKey public); ensure no public API signatures rely on ShapeKey
so the public surface stays unchanged.
In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs`:
- Around line 324-326: The rented buffer "state" and the snapshot array
"allStates" must be initialized to deterministic values and the t=0 snapshot
seeded so subsequent reads (e.g., where "state" is read before being written and
where "_lastStates[..., 0, ...]" is expected) are well-defined; modify the
allocation code that uses TensorAllocator.Rent<T>(...) to clear or fill "state"
with zeros (or default(T)) immediately after rental and copy that initial zeroed
"state" into allStates at index t=0 (and into _lastStates slice for t=0) before
entering the recurrence loop so the recurrence base and stored _lastStates[...,
0, ...] are deterministic. Ensure you reference the variables "state",
"allStates", "_lastStates", and the use of TensorAllocator.Rent<T>" when making
the change.
In `@src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs`:
- Around line 296-327: After renting the buffer with TensorAllocator.Rent<T> for
expertStates in MixtureOfMambaLayer, zero-initialize the entire tensor before
the scan loop to avoid stale pooled values being read; e.g., immediately after
expertStates = TensorAllocator.Rent<T>(...) clear all entries to NumOps.Zero
(via a provided TensorAllocator.Clear/Fill utility or a tight nested loop over
batchSize, _numExperts, seqLen+1, _stateDimension) so reads like
expertStates[new[] { bi, expertIdx, t, si }] see zeros for uninitialized
experts.
In `@src/NeuralNetworks/Layers/SSM/RetNetLayer.cs`:
- Around line 381-382: The rented tensor retentionScores (from
TensorAllocator.Rent<T>(...)) contains stale data in entries where j>i because
only the causal lower triangle is written later; zero out the causal-mask gap
before storing it in _lastRetentionScores. Locate the allocation around
retentionScores and either initialize the full tensor to zero or explicitly zero
the strict upper-triangle (indices where j>i for dimensions batchSize,
_numHeads, seqLen, seqLen) so _lastRetentionScores does not expose stale values
to backward passes.
In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs`:
- Around line 386-388: The rented buffers state, allStates and
selectionWeightsCache are used before being initialized; after calling
TensorAllocator.Rent<T>(...) you must initialize state (e.g. zero or the proper
initial recurrence state) and immediately write the t=0 snapshot into allStates
at index 0 so _lastStates/.../0/... isn't left stale; update the code around
TensorAllocator.Rent usage (symbols: state, allStates, selectionWeightsCache,
TensorAllocator.Rent) to set state to the correct initial values and copy/assign
that state into allStates[..., 0, ...] before the time-loop and before any
reads.
In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Line 673: ApplyGroupNorm and GroupNormBackward allocate a fresh shape array
per token via TensorAllocator.Rent<T>(input.Shape.ToArray()) called inside
TimeMixingForward; change these methods to accept a caller-owned
output/workspace Tensor (or Span) parameter so the caller (TimeMixingForward)
rents a buffer once and passes it in, eliminating the input.Shape.ToArray()
allocation. Update the signatures of ApplyGroupNorm and GroupNormBackward to
take the output/workspace buffer and modify their call sites in
TimeMixingForward (and the training path) to reuse the pre-rented buffer instead
of calling Rent with input.Shape.ToArray(); ensure both occurrences (the forward
call and the backward call) are updated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d8e06a5a-7942-4786-8e52-a90541475a01
📒 Files selected for processing (22)
src/Memory/ForwardArena.cssrc/NeuralNetworks/Layers/DiffusionConvLayer.cssrc/NeuralNetworks/Layers/SSM/ABCLayer.cssrc/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cssrc/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cssrc/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/HGRN2Layer.cssrc/NeuralNetworks/Layers/SSM/HedgehogLayer.cssrc/NeuralNetworks/Layers/SSM/HyenaLayer.cssrc/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/LogLinearAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/MesaNetLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cssrc/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cssrc/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cssrc/NeuralNetworks/Layers/SSM/RWKV7Block.cssrc/NeuralNetworks/Layers/SSM/RebasedLayer.cssrc/NeuralNetworks/Layers/SSM/RetNetLayer.cssrc/NeuralNetworks/Layers/SSM/RodimusLayer.cssrc/NeuralNetworks/Layers/SSM/S6Scan.cssrc/NeuralNetworks/Layers/SSM/TTTLayer.cssrc/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs
- ShapeKey: only defensive-copy in EnsureCapacity (storage), not in Rent (hot-path lookup). Eliminates per-Rent allocation from Clone(). - GatedDeltaNetLayer, GatedDeltaProductLayer, KimiLinearAttentionLayer: reverted state/allStates buffer allocations back to new Tensor<T>() for guaranteed zero initialization of recurrent state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 80 out of 80 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Cache intermediate values for backward — zero allocation via workspace | ||
| var allR = Ws.Sequence(SqAllR); | ||
| var allK = Ws.Sequence(SqAllK); | ||
| var allV = Ws.Sequence(SqAllV); | ||
| var allA = Ws.Sequence(SqAllA); | ||
| var allB = Ws.Sequence(SqAllB); | ||
| var allWkv = Ws.Sequence(SqAllWkv); | ||
| var allWkvPreGate = Ws.Sequence(SqAllWkvPre); | ||
| var allWkvGated = Ws.Sequence(SqAllWkvGated); |
There was a problem hiding this comment.
allWkvPreGate is declared from Ws.Sequence(SqAllWkvPre) but never used. With TreatWarningsAsErrors enabled, this will fail the build (unused local). Either remove the local (and consider removing the SqAllWkvPre workspace declaration if it's truly unused) or actually populate/use this buffer as intended (e.g., cache pre-gate or pre-norm WKV values).
…terface segregation Resolved 3 merge conflicts between PR #1079 (hardcoded double conversion) and master (interface segregation #1077, arena allocator #1083). Kept master's InterfaceGuard/ParamModel caching and PR's SIMD-accelerated operations. Fixed pre-existing NumOps.Tanh(T) scalar calls that don't exist — Tanh on NumOps is span-only. Added ScalarTanh/ScalarSigmoid to MetaLearnerBase using NumOps.Exp primitives (matching IcaDecomposition pattern). Fixed NTMAlgorithm inner classes with same formula. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…terface segregation Resolved 3 merge conflicts between PR #1079 (hardcoded double conversion) and master (interface segregation #1077, arena allocator #1083). Kept master's InterfaceGuard/ParamModel caching and PR's SIMD-accelerated operations. Fixed NumOps.Tanh(T) calls — Tanh on NumOps is span-only. Added ScalarTanh and ScalarSigmoid to MetaLearnerBase using NumOps.Exp primitives. Fixed NTMAlgorithm inner classes with same formula. Replaced manual Softmax with Engine.Softmax. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Introduces per-layer arena allocator (LayerWorkspace) and converts 300+ tensor allocations
to pooled/zero-allocation patterns across 50+ neural network layers.
Closes #1014
Benchmark Results
Changes
Infrastructure:
RWKV7Block (highest impact):
50+ Layers (systematic migration):
Test plan
Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor
Performance