Skip to content

perf: LayerWorkspace arena allocator for zero-allocation forward passes (#1014)#1083

Merged
ooples merged 10 commits intomasterfrom
perf/tensor-allocator-rent-1014
Apr 5, 2026
Merged

perf: LayerWorkspace arena allocator for zero-allocation forward passes (#1014)#1083
ooples merged 10 commits intomasterfrom
perf/tensor-allocator-rent-1014

Conversation

@ooples
Copy link
Copy Markdown
Owner

@ooples ooples commented Apr 4, 2026

Summary

Introduces per-layer arena allocator (LayerWorkspace) and converts 300+ tensor allocations
to pooled/zero-allocation patterns across 50+ neural network layers.

Closes #1014

Benchmark Results

Method Mean Allocated vs Baseline
new Tensor (raw) 46,509 ns 199 KB 1.0x
LayerWorkspace 200 ns 0 B 232x faster
Single Workspace 2.2 ns 0 B 77x faster

Changes

Infrastructure:

  • ForwardArena: bump-pointer allocator with shape-keyed slabs (arbitrary rank support)
  • LayerWorkspace: index-based pre-allocated tensor array (O(1) access, 2.2ns)
  • ShapeKey: value-type FNV-1a hash struct supporting any tensor rank
  • LayerBase: added Workspace property for opt-in arena usage

RWKV7Block (highest impact):

  • 20 hot-path forward allocations -> LayerWorkspace (zero allocation in timestep loops)
  • BeginForward called once in main Forward for both TimeMixing and ChannelMixing

50+ Layers (systematic migration):

  • 300+ per-forward raw allocations -> TensorAllocator.Rent (pooled)
  • 18 ones-tensor allocations -> Engine.ScalarMinusTensor (eliminated entirely)
  • 21 recurrent state buffers reverted to new Tensor for guaranteed zero-init

Test plan

  • dotnet build src/AiDotNet.csproj --framework net10.0 -c Release - 0 errors
  • BenchmarkDotNet confirms zero bytes allocated in workspace path
  • All 25 review comments addressed (zero-init safety, ShapeKey rank, validation)

Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added pooled arena/workspace memory support and microbenchmarks comparing allocation patterns.
  • Refactor

    • Many layers switched to pooled/workspace-backed tensor allocation to enable reuse and lower per-call heap churn.
  • Performance

    • Reduced transient heap allocations across hot paths, improving memory reuse with expected lower latency and better throughput.

ooples and others added 4 commits April 4, 2026 17:45
… passes

Step 1 of #1014: Core arena infrastructure.

- ForwardArena<T>: bump-pointer allocator that pre-allocates Tensor<T> objects
  grouped by shape, dishes them out via array index increment (O(1), zero
  syscalls), and resets cursors at end of forward pass. Zero GC pressure.
- LayerWorkspace<T>: per-layer facade with named buffer declarations
  (DeclareTimestepBuffer/DeclareSequenceBuffer) and shape-aware pre-sizing.
  Handles variable batch sizes and sequence lengths.
- ShapeKey: value-type struct with FNV-1a hash for O(1) dictionary lookups.
- LayerBase<T>: added protected Workspace property for opt-in arena usage.

Next: migrate RWKV7Block (84 allocations -> 0) as proof of pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocation

Benchmark results (RWKV7Block allocation pattern, batch=1, seqLen=32):

| Method              | Mean       | Allocated | vs Baseline |
|---------------------|------------|-----------|-------------|
| new Tensor (raw)    | 46,509 ns  | 199 KB    | 1.0x        |
| ForwardArena.Rent   | 107,854 ns | 0 B       | 0.4x        |
| LayerWorkspace      | 200 ns     | 0 B       | 232x faster |
| Single new Tensor   | 171 ns     | 584 B     |             |
| Single Arena.Rent   | 453 ns     | 0 B       |             |
| Single Workspace    | 2.2 ns     | 0 B       | 77x faster  |

LayerWorkspace achieves zero allocation via index-based pre-allocated
tensor array lookup (2.2ns per access). ForwardArena is slower than
raw allocation due to Span.Clear() overhead — workspace is the
production target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…inated)

TimeMixingForward hot path:
- 8 sequence buffers (allR, allK, allV, allA, allB, allWkv, allWkvPreGate,
  allWkvGated) now pre-allocated via Workspace.Sequence()
- 5 per-timestep buffers (rInput, kInput, vInput, aInput, bInput) now
  pre-allocated via Workspace.Timestep() — reused across seqLen iterations

Index-based buffer access: const int indices for O(1) array lookup (2.2ns).

71 allocations remain in deeper methods — continuing migration.

Part of #1014.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All forward-path hot loop allocations eliminated:
- TimeMixingForward: 8 sequence buffers + 6 timestep buffers = 14 via workspace
- ChannelMixingForward: 4 sequence buffers + 3 timestep buffers = 7 via workspace
- Total: 20 forward-path allocations → 0 (84 → 64 remaining, all cold-path/backward)

Remaining 64 allocations are:
- 24 constructor parameter init (cold path, one-time)
- 24 gradient accumulator init (cold path, one-time)
- 2 state fallbacks (conditional, first-call only)
- 14 backward pass (training only)

Production-safe: uses Ws property with null-check throw instead of
null-forgiving operator. No !  operators used.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 4, 2026 23:31
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 4, 2026

Deployment failed with the following error:

Resource is limited - try again in 24 hours (more than 100, code: "api-deployments-free-per-day").

Learn More: https://vercel.com/franklins-projects-02a0b5a0?upgradeToPro=build-rate-limit

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
aidotnet_website Ignored Ignored Preview Apr 5, 2026 5:09am
aidotnet-playground-api Ignored Ignored Preview Apr 5, 2026 5:09am

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a bump‑pointer arena (ForwardArena<T>) and a per‑layer workspace (LayerWorkspace<T>), integrates workspace support into layers (notably RWKV7Block), bulk-migrates many temporary new Tensor<T>(...) allocations to TensorAllocator.Rent<T>(...), and adds BenchmarkDotNet tests comparing allocation strategies.

Changes

Cohort / File(s) Summary
Memory infra
src/Memory/ForwardArena.cs, src/Memory/LayerWorkspace.cs
Add ForwardArena<T> (shape-keyed slabs, Rent/RentUninitialized/GrowAndRent/EnsureCapacity/Reset, capacity metrics) and LayerWorkspace<T> (DeclareTimestep/DeclareSequence, BeginForward to (re)allocate timestep/sequence tensors, O(1) accessors, TotalPreAllocated). Review: growth policy, zeroing/clear semantics, concurrency, and exception-safety. BLOCKING: clarify zero-init vs uninitialized contract and return lifecycle.
ShapeKey value-type
src/Memory/...
Add ShapeKey readonly struct for dictionary keys (deep-clones shape array, cached hash, elementwise equality). Review: hash correctness across ranks and mutation-safety of stored arrays.
Layer integration
src/NeuralNetworks/Layers/LayerBase.cs, src/NeuralNetworks/Layers/SSM/RWKV7Block.cs
Add protected LayerWorkspace<T>? Workspace to LayerBase<T>; create/use per-block LayerWorkspace in RWKV7Block (declare indices, BeginForward, swap many per-forward allocations to workspace tensors). Review: index mappings, BeginForward lifecycle, null-fallback behavior, thread-safety. BLOCKING: document Workspace lifecycle and concurrency expectations.
Bulk allocator adoption
src/NeuralNetworks/Layers/... (many files listed in raw summary)
Replace numerous new Tensor<T>(...) temporaries with TensorAllocator.Rent<T>(...) across many layers (SSM family, attention, RNNs, convs, utilities). Review: allocator contract (zeroed vs uninitialized), ownership/return semantics, potential leaks, and assumptions that buffers start zeroed. BLOCKING if allocator semantics differ from callers' expectations.
Benchmarks
AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs
Add BenchmarkDotNet class comparing raw allocations, arena rents, and workspace lookups plus microbenchmarks; introduces InProcessConfig and MemoryDiagnoser. Review: benchmark isolation, correct warmup/config, and prevention of dead-code elimination.
Minor edits
src/Diffusion/Audio/ShortTimeFourierTransform.cs, many files with using AiDotNet.Memory; and BOM removals
Small allocation switch in inverse STFT and many using/BOM tweaks. Review: ensure no unintended public API changes or build issues.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Layer as RWKV7Block
    participant Ws as LayerWorkspace
    participant Arena as ForwardArena
    participant Alloc as TensorAllocator

    App->>Layer: Forward(batchSize, seqLen)
    Layer->>Ws: BeginForward(batchSize, seqLen)
    Ws->>Arena: EnsureCapacity / Rent shapes (optional)
    Arena-->>Ws: Tensors populated in workspace slots
    Ws-->>Layer: Workspace tensors ready

    loop per-timestep
        Layer->>Ws: Timestep(index) / Sequence(index)
        Ws-->>Layer: Tensor<T> (preallocated slot)
        Layer->>Layer: compute using buffer
    end

    alt fallback path
        Layer->>Alloc: Rent(shape)
        Alloc-->>Layer: Tensor<T> (rented)
    end

    Layer-->>App: Forward result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

feature

✨ An arena wakes, the tensors sing,
Buffers aligned for forward-ing,
Rent and reuse, no GC cry,
Benchmarks hum — allocations fly! 🪄

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing LayerWorkspace arena allocator for zero-allocation forward passes, addressing issue #1014.
Linked Issues check ✅ Passed The PR comprehensively addresses all Phase 1 objectives from #1014: replaces 300+ new Tensor allocations with TensorAllocator.Rent, eliminates 18 ones-tensor allocations via Engine.ScalarMinusTensor, and introduces ForwardArena/LayerWorkspace infrastructure for zero-allocation forward passes.
Out of Scope Changes check ✅ Passed All changes align with Phase 1 objectives: arena allocator infrastructure, layer-wide Rent conversions, ones-tensor optimizations, and RWKV7Block workspace integration. Minor BOM/whitespace changes are inconsequential. No unrelated features or refactoring detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/tensor-allocator-rent-1014

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a per-layer “arena/workspace” mechanism intended to eliminate per-forward-pass Tensor<T> allocations in neural-network layer hot paths, and migrates RWKV7Block forward intermediates to use it. It also adds a BenchmarkDotNet benchmark to measure allocation/lookup overhead of the new strategies.

Changes:

  • Add LayerWorkspace<T> (index-based preallocated buffers) and ForwardArena<T> (shape-keyed slab allocator) under AiDotNet.Memory.
  • Add LayerBase<T>.Workspace to support opt-in per-layer workspace usage.
  • Update RWKV7Block to reuse workspace buffers for forward intermediates and add an allocation benchmark.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/NeuralNetworks/Layers/SSM/RWKV7Block.cs Uses LayerWorkspace<T> to eliminate forward-path intermediate tensor allocations.
src/NeuralNetworks/Layers/LayerBase.cs Adds a protected Workspace property for layers to opt into workspace allocation.
src/Memory/LayerWorkspace.cs New index-based per-layer buffer store for timestep/sequence tensors.
src/Memory/ForwardArena.cs New arena allocator with shape-keyed slabs and cursors.
AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs Adds benchmarks comparing raw allocation vs arena vs workspace lookup.
Comments suppressed due to low confidence (1)

src/NeuralNetworks/Layers/LayerBase.cs:61

  • The XML doc comment here has two opening tags in a row with only one closing tag, which can produce malformed XML doc warnings (and this repo treats warnings as errors). Remove the extra /// <summary> so the comment is well-formed.
    protected LayerWorkspace<T>? Workspace { get; set; }

    /// <summary>
    /// <summary>
    /// Gets the element-wise activation function for this layer, if specified.
    /// </summary>

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs`:
- Around line 119-145: The benchmark Workspace_RWKV7Pattern currently omits the
workspace shape-init call; call _workspace.BeginForward with the correct batch
and sequence lengths inside the measured method so the measured path includes
the fast-path shape check (e.g. call _workspace.BeginForward(BatchSize,
TimestepsPerForward) at the start of Workspace_RWKV7Pattern before any
Sequence/Timestep calls), ensuring BeginForward() is executed on every benchmark
iteration and not only in setup.

In `@src/Memory/ForwardArena.cs`:
- Around line 131-168: The ShapeKey currently only encodes rank and up to four
dimensions, causing different rank>=5 shapes to collide; update ShapeKey
(constructor, GetHashCode, and Equals) to either incorporate all shape elements
into the stored state and FNV-1a hash (e.g., process shape[i] for i from
0..shape.Length-1) and compare all elements in Equals, or explicitly guard by
throwing an exception in the ShapeKey ctor when shape.Length > 4 to fail closed;
ensure the chosen approach updates both the hashing logic (_hash) and equality
logic (Equals(ShapeKey) and Equals(object)) consistently so ShapeKey uniquely
represents the entire shape.

In `@src/Memory/LayerWorkspace.cs`:
- Around line 32-34: The BeginForward method currently trusts batch/sequence
sizes via the fields _lastBatchSize/_lastSeqLen and can silently skip allocation
on first call (e.g., BeginForward(0,0)); modify BeginForward to validate input
parameters (batchSize and seqLen) up-front and either reject zero or negative
sizes by throwing ArgumentOutOfRangeException, or initialize sentinel cached
sizes so the first non-zero call forces allocation; also update any slot
accessor methods (e.g., GetSlot/SetSlot or indexers used for workspace slots) to
validate slot indices and throw ArgumentOutOfRangeException instead of letting
raw array indexing propagate IndexOutOfRangeException; ensure all uses of
_lastBatchSize/_lastSeqLen follow the new validation logic so the fast path
cannot skip required allocation.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 314-341: RWKV7Block<T> currently inherits LayerBase<T>.Clone()
which does a shallow MemberwiseClone(), causing the cloned instance to share the
same Workspace buffers; override Clone() in RWKV7Block<T> to call
MemberwiseClone() but then re-create a fresh LayerWorkspace<T> and re-run the
workspace declarations (the same sequence of Workspace = new
LayerWorkspace<T>(...) and Workspace.DeclareTimestep/DeclareSequence calls) so
the clone has independent buffers; factor the workspace declaration lines into a
private helper method (e.g., InitializeWorkspace()) and call it from both the
constructor and your new Clone() implementation to avoid duplication.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c0438f16-3cf3-4db6-ade6-d2c82e7e2ddc

📥 Commits

Reviewing files that changed from the base of the PR and between eb3fc1b and 674eb7d.

📒 Files selected for processing (5)
  • AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs
  • src/Memory/ForwardArena.cs
  • src/Memory/LayerWorkspace.cs
  • src/NeuralNetworks/Layers/LayerBase.cs
  • src/NeuralNetworks/Layers/SSM/RWKV7Block.cs

ooples and others added 2 commits April 4, 2026 20:53
…ent across 50+ layers

Systematic migration of batchSize-shaped tensor allocations from
new Tensor<T>() to TensorAllocator.Rent<T>() across all NN layers.

Before: 1722 raw allocations in forward paths
After:  1498 raw allocations + 332 pooled TensorAllocator.Rent calls
Eliminated: 224 per-forward-pass allocations (13% reduction)

Key layers migrated:
- S4DLayer: 7, S5Layer: 7, Mamba2Block: 12, MegalodonLayer: 7
- LinearRecurrentUnitLayer: 7, S6Scan: 15, ExtendedLSTMLayer: 12
- BASEDLayer: 10, RWKVLayer: 21, TransNormerLLMLayer: 8
- DeltaFormerLayer: 10, HedgehogLayer: 7, RebasedLayer: 8
- Plus 30+ additional SSM/attention/recurrent layers

Remaining 1498 allocations are:
- Constructor parameter init (weights, biases) — cold path
- Gradient accumulators — training setup
- Backward pass intermediates — training only
- Fixed-shape constants

Part of #1014.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… files

Minor additions to the batch migration.

Running totals for #1014:
- NN Layers: 1498 raw + 332 pooled (was 1722 raw + ~100 pooled)
- RWKV7Block: 20 hot-path allocations → LayerWorkspace (232x speedup)
- All layers: 225 per-forward allocations → TensorAllocator.Rent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 00:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 54 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs (1)

232-277: ⚠️ Potential issue | 🟠 Major

Blocking: rented buffers make current recurrence/output writes non-deterministic.

Two issues:

  • Line 258: state is read at t=0 without explicit initialization.
  • Lines 276-277: output uses += semantics even though each slot is produced once per head.
💡 Suggested fix
-                            T prevState = state[new[] { bi, di, ki }];
+                            T prevState = t == 0
+                                ? NumOps.Zero
+                                : state[new[] { bi, di, ki }];
...
-                        output[new[] { bi, t, flatK }] = NumOps.Add(
-                            output[new[] { bi, t, flatK }], sum);
+                        output[new[] { bi, t, flatK }] = sum;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs` around lines 232
- 277, The rented per-head state tensor (state) is used before being initialized
and the output is being accumulated with += even though each head produces each
output slot exactly once; to fix, explicitly zero-initialize the rented state
immediately after Rent (the local variable state in the hi loop) and ensure
output entries for the current head are either pre-zeroed once before the head
loop or assigned (not added) when writing (change the output update in the inner
loop that currently does output[...] = output[...] + sum to a direct assignment
if you zeroed output per-head, or zero output once before the hi loop and keep
the additive update), updating the code around the state allocation and the
output write (refer to the variables state, gateVal, kVal, vVal, qVal, and
output used in the nested loops).
src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs (1)

455-470: ⚠️ Potential issue | 🟠 Major

Blocking: aggregation now depends on implicit zeroed memory.

aggregated[...] = aggregated[...] + ... at Lines 468-470 requires a guaranteed zero initial buffer. Please compute into a local accumulator and assign once.

💡 Suggested fix
-                        {
-                            aggregated[new[] { b, kvh, s, d }] = NumOps.Add(
-                                aggregated[new[] { b, kvh, s, d }],
-                                expandedGrad[new[] { b, qh, s, d }]);
-                        }
+                        {
+                            T sum = NumOps.Zero;
+                            for (int g2 = 0; g2 < _headsPerGroup; g2++)
+                            {
+                                int qh2 = kvh * _headsPerGroup + g2;
+                                sum = NumOps.Add(sum, expandedGrad[new[] { b, qh2, s, d }]);
+                            }
+                            aggregated[new[] { b, kvh, s, d }] = sum;
+                            break;
+                        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs` around lines 455 -
470, The loop currently increments aggregated[...] in-place and relies on the
rented buffer being zeroed; instead, for each combination of b, kvh, s, d
compute a local accumulator (e.g. T acc = default(T)) and inside the g loop do
acc = NumOps.Add(acc, expandedGrad[new[] { b, qh, s, d }]) and after the g loop
assign aggregated[new[] { b, kvh, s, d }] = acc; update the nested loops around
variables batchSize, _numKVHeads, _headsPerGroup, _headDimension and keep use of
TensorAllocator.Rent and expandedGrad but avoid reading aggregated before
assignment.
src/NeuralNetworks/Layers/SSM/RodimusLayer.cs (1)

386-451: ⚠️ Potential issue | 🟠 Major

Blocking: recurrent state must not rely on zeroed pooled memory.

Line 446 reads state on the first timestep. With rented buffers, explicitly handle t == 0 (and seed cached t=0 state if backward expects it).

💡 Suggested fix
-                            T prevS = state[new[] { bi, hi, di, ki }];
+                            T prevS = t == 0
+                                ? NumOps.Zero
+                                : state[new[] { bi, hi, di, ki }];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs` around lines 386 - 451, The
rented tensor "state" is pooled and may contain garbage, but the loop reads
previous state at t==0 (prevS = state[...]) — explicitly initialize or seed the
recurrent state for t==0 instead of relying on zeros in pooled memory: inside
the outer time loop (for int t = 0; t < seqLen; t++) or before it, detect t==0
and either zero-fill the "state" tensor (using NumOps.Zero assignments) or set
prevS to NumOps.Zero when t==0 and also write that seeded state into "state" and
"allStates" (or initialize allStates[...,0,...]) so downstream code/backward
passes see a defined t=0 state; update places that read prevS (the read at prevS
= state[new[] { bi, hi, di, ki }]) to use the initialized value.
src/NeuralNetworks/Layers/SSM/RWKVLayer.cs (1)

340-353: ⚠️ Potential issue | 🟠 Major

Remove the unused shifted pass from the hot path.

Line 340 rents shifted, then Lines 341-353 fill it, but nothing ever reads it. rInput, kInput, and vInput are rebuilt from x_t/xPrev below, so this is pure extra work inside the perf-sensitive timestep loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RWKVLayer.cs` around lines 340 - 353, The
rented temporary tensor "shifted" is never read and should be removed from the
hot timestep loop: delete the TensorAllocator.Rent<T>(...) call and all
assignments to shifted[new[] { bi, d }] inside the nested loops, and also remove
any later references or Return/Dispose calls for "shifted" if present; keep
existing computations that build rInput/kInput/vInput from x_t and xPrev and
retain use of _timeMixR/_timeMixK/_timeMixV so behavior is unchanged while
removing the unused allocation and writes.
♻️ Duplicate comments (1)
src/NeuralNetworks/Layers/SSM/RWKV7Block.cs (1)

314-359: ⚠️ Potential issue | 🔴 Critical

Give each RWKV7Block<T> clone its own workspace buffers.

These lines make Workspace part of the block’s mutable execution state. If RWKV7Block<T> still inherits a shallow LayerBase<T>.Clone(), the clone and original will alias the same scratch buffers and cached sequence tensors, so a forward on one instance can corrupt the other. Move the declarations into an InitializeWorkspace() helper and call it from the constructor path and the block-specific Clone() override.

♻️ Suggested refactor
-        // Arena workspace: pre-allocate forward pass buffers for zero-allocation hot path
-        Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12);
-        // TimeMixing timestep buffers
-        Workspace.DeclareTimestep(TsRInput, _modelDimension);
-        Workspace.DeclareTimestep(TsKInput, _modelDimension);
-        Workspace.DeclareTimestep(TsVInput, _modelDimension);
-        Workspace.DeclareTimestep(TsAInput, _modelDimension);
-        Workspace.DeclareTimestep(TsBInput, _modelDimension);
-        Workspace.DeclareTimestep(TsWkvOut, _modelDimension);
-        Workspace.DeclareTimestep(TsYt, _modelDimension);
-        // ChannelMixing timestep buffers
-        Workspace.DeclareTimestep(TsCmRInput, _modelDimension);
-        Workspace.DeclareTimestep(TsCmKInput, _modelDimension);
-        Workspace.DeclareTimestep(TsCmKSiLU, _ffnDimension);
-        // TimeMixing sequence buffers
-        Workspace.DeclareSequence(SqAllR, _modelDimension);
-        Workspace.DeclareSequence(SqAllK, _modelDimension);
-        Workspace.DeclareSequence(SqAllV, _modelDimension);
-        Workspace.DeclareSequence(SqAllA, _modelDimension);
-        Workspace.DeclareSequence(SqAllB, _modelDimension);
-        Workspace.DeclareSequence(SqAllWkv, _modelDimension);
-        Workspace.DeclareSequence(SqAllWkvPre, _modelDimension);
-        Workspace.DeclareSequence(SqAllWkvGated, _modelDimension);
-        // ChannelMixing sequence buffers
-        Workspace.DeclareSequence(SqCmAllRGate, _modelDimension);
-        Workspace.DeclareSequence(SqCmAllVProj, _modelDimension);
-        Workspace.DeclareSequence(SqCmAllSiLU, _ffnDimension);
-        Workspace.DeclareSequence(SqCmAllKProj, _ffnDimension);
+        InitializeWorkspace();
private void InitializeWorkspace()
{
    Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12);
    // existing DeclareTimestep / DeclareSequence calls...
}

// Call InitializeWorkspace() from this class's Clone() override too.

Verify the clone path. Expected result: LayerBase<T> still uses a shallow clone and RWKV7Block<T> has no override yet.

#!/bin/bash
set -euo pipefail

echo "LayerBase clone implementation:"
fd 'LayerBase.cs$' src --exec sed -n '1,260p' {}

echo
echo "RWKV7Block clone declarations:"
rg -n '\bClone\s*\(' src/NeuralNetworks/Layers/SSM/RWKV7Block.cs || true

echo
echo "RWKV7Block workspace setup:"
sed -n '314,360p' src/NeuralNetworks/Layers/SSM/RWKV7Block.cs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs` around lines 314 - 359, The
Workspace is currently allocated in the constructor body making it shared when
LayerBase<T>.Clone() does a shallow clone; move the allocation and all
Workspace.DeclareTimestep/DeclareSequence calls into a new private
InitializeWorkspace() method (create LayerWorkspace<T>(timestepCount: 10,
sequenceCount: 12) and call the same Declare* entries there) and invoke
InitializeWorkspace() from the constructor and from the RWKV7Block<T>.Clone()
override so each cloned block gets its own LayerWorkspace<T> instead of aliasing
the original Workspace; update references to Workspace and keep the existing
Ts*/Sq* constants unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs`:
- Around line 252-265: The rented recurrent buffers (cellState, normState,
allCellStates, allNormStates and any other rented tensors used as
recurrence/caches) must be zero-initialized before being read/accumulated;
update the code that rents these tensors to explicitly clear them (e.g. call the
tensor's zero/Fill/Clear method or set elements to default(T)) immediately after
TensorAllocator.Rent<T>(...) and ensure the initial timestep/cached slot
(allCellStates[...,0] and allNormStates[...,0]) is set to zeros so the first
timestep and cached state remain deterministic; apply the same zeroing to the
other rented recurrent buffers referenced in the forward/backward paths
(cellState, normState and the seq-length arrays).

In `@src/NeuralNetworks/Layers/SSM/HGRNLayer.cs`:
- Around line 320-324: The rented buffers can contain stale data so preserve the
zero initial hidden state by explicitly zero-initializing `h` and the timestep-0
slice of `allHidden` after calling `TensorAllocator.Rent<T>`; specifically, in
HGRNLayer (variables `h` and `allHidden`) set all elements of `h` to zero before
it is first read and copy/clear the [batch, 0, modelDim] slice of `allHidden` to
zeros (or write `h` into that slice) so the initial hidden-state snapshot
remains deterministic when using `TensorAllocator.Rent<T>`.

In `@src/NeuralNetworks/Layers/SSM/MEGALayer.cs`:
- Around line 374-375: The rented tensor "states" from
TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _emaDimension }) must
have its timestep-0 slice explicitly zeroed to ensure deterministic EMA startup;
modify the code after the Rent call to set states[..., 0, ...] (the t=0 slice
across batch and _emaDimension) to zeros (use the tensor fill/clear API or
Memory.Clear equivalent) so that subsequent reads like states[..., t, ...]
assume a defined t=0 EMA state.
- Line 417: The rented attnScores tensor in MEGALayer (created via
TensorAllocator.Rent<T>(...)) can contain stale upper-triangle values because
only j<=i entries are written; immediately clear or initialize the whole tensor
after renting to eliminate stale data. Replace the direct Rent<T> usage with a
Rent followed by a full clear/initialization (e.g.,
TensorAllocator.Clear<T>(attnScores) or a Fill with default(T)), or, if numeric
types are known (float/double), initialize masked positions to a large negative
value (e.g., -Infinity) so softmax will ignore them; ensure this happens before
any downstream read of attnScores.

In `@src/NeuralNetworks/Layers/SSM/MinLSTMLayer.cs`:
- Around line 370-373: The rented buffers from TensorAllocator.Rent<T>
(cellState and allCellStates) may contain junk; explicitly seed/zero them
immediately after renting and before any read at runtime: after calling
TensorAllocator.Rent<T>(...) for cellState and for allCellStates, call the
allocator/ tensor utility to clear or fill those tensors with zero (and also set
allCellStates[...,0,...] to zero) so the initial recurrent state is
deterministic; locate the allocations of cellState and allCellStates in
MinLSTMLayer.cs and insert the zeroing step before any usage (e.g., before the
first read at the current cell update).

In `@src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs`:
- Around line 425-427: The rented buffer allStates (from
TensorAllocator.Rent<T>) is not initialized so the t=0 snapshot is undefined;
after renting allStates, explicitly initialize the index 0 timeslice with the
current base state (copy _lastStates into allStates[..., 0, ...] or clear the
buffer to zeros) so that the recurrence invariant holds and readers of
_lastStates see a valid t=0 snapshot; ensure the same initialization is applied
wherever you write t+1 (the loop that writes into allStates at t+1) so the
initial state at index 0 is correctly set before the timestep loop runs.

In `@src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs`:
- Line 312: The rented attention buffer allAttnWeights (created via
TensorAllocator.Rent<T>) may contain stale upper-triangular values because only
the causal half (tk <= tq) is written later; explicitly zero the masked half of
the tensor after renting and before use so masked positions cannot leak old
data. Locate the Rent call that creates allAttnWeights/_lastAttnWeights in the
MultiLatentAttentionLayer (around the code that uses seqLen, batchSize, and
_numHeads) and clear all entries where tk > tq (the upper-triangular/masked
half) — either by calling the allocator/tensor zeroing API if available or by
iterating and setting those elements to zero — before any reads/writes that
assume those positions are zero.

In `@src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs`:
- Around line 288-290: The rented buffers h, allHidden, and allDecay in
RealGatedLinearRecurrenceLayer are coming from a pool and may contain stale
data; after calling TensorAllocator.Rent<T>(...) you must explicitly
zero-initialize them so the RG-LRU initial hidden state and timestep-0 hidden
slice are zero. Locate the Rent calls that assign h, allHidden, and allDecay and
immediately clear them (e.g., fill/zero the entire tensor or at minimum set
allHidden[...,0,...] and h to zero) so no pooled garbage is consumed before the
first write; ensure the chosen zeroing method matches your Tensor API and is
applied consistently for all three rented tensors.

In `@src/NeuralNetworks/Layers/SSM/RebasedLayer.cs`:
- Around line 424-426: The rented gradient accumulators dQ, dK, and dV returned
by TensorAllocator.Rent<T>(...) must be explicitly zeroed before any +=
accumulation to avoid leaking stale data; update the RebasedLayer implementation
to clear or fill zeros into the tensors immediately after renting (before any
use) so the subsequent increments at the accumulation sites (the += operations
around the existing lines 539-542 and 589-592) produce correct gradients.

In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs`:
- Around line 507-512: The rented recurrence state tensors hReal and hImag (and
the initial slot allHiddenReal[...,0] / allHiddenImag[...,0]) are used before
being set, so explicitly zero them after allocation instead of relying on
allocator behavior: after calling TensorAllocator.Rent<T>(...) for hReal/hImag
and for allHiddenReal/allHiddenImag, fill hReal and hImag with zeros and copy
those zeros into the index-0 slice of allHiddenReal/allHiddenImag (or
memset/clear the whole allHidden tensors if simpler) so the initial hidden state
is deterministically zero; locate these tensors where they are created in
S5Layer.cs (variables hReal, hImag, allHiddenReal, allHiddenImag) and perform
the explicit zeroing immediately after each Rent call.

In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs`:
- Around line 407-408: The rented tensors `state` and `allStates` from
`TensorAllocator.Rent<T>(...)` must be explicitly initialized to their
zero/default recurrence values before the first timestep; modify the code around
the `state` and `allStates` allocations (the `state` and `allStates` variables)
to zero-fill or otherwise set the `[*, 0, ...]` snapshot and the `state` buffer
immediately after rental (using your tensor fill/clear utility or by writing
zeros across shapes constructed from `batchSize`, `_numHeads`, `_headDimension`)
so the forward pass does not rely on `Rent` returning zeroed memory.

---

Outside diff comments:
In `@src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs`:
- Around line 455-470: The loop currently increments aggregated[...] in-place
and relies on the rented buffer being zeroed; instead, for each combination of
b, kvh, s, d compute a local accumulator (e.g. T acc = default(T)) and inside
the g loop do acc = NumOps.Add(acc, expandedGrad[new[] { b, qh, s, d }]) and
after the g loop assign aggregated[new[] { b, kvh, s, d }] = acc; update the
nested loops around variables batchSize, _numKVHeads, _headsPerGroup,
_headDimension and keep use of TensorAllocator.Rent and expandedGrad but avoid
reading aggregated before assignment.

In `@src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs`:
- Around line 232-277: The rented per-head state tensor (state) is used before
being initialized and the output is being accumulated with += even though each
head produces each output slot exactly once; to fix, explicitly zero-initialize
the rented state immediately after Rent (the local variable state in the hi
loop) and ensure output entries for the current head are either pre-zeroed once
before the head loop or assigned (not added) when writing (change the output
update in the inner loop that currently does output[...] = output[...] + sum to
a direct assignment if you zeroed output per-head, or zero output once before
the hi loop and keep the additive update), updating the code around the state
allocation and the output write (refer to the variables state, gateVal, kVal,
vVal, qVal, and output used in the nested loops).

In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs`:
- Around line 386-451: The rented tensor "state" is pooled and may contain
garbage, but the loop reads previous state at t==0 (prevS = state[...]) —
explicitly initialize or seed the recurrent state for t==0 instead of relying on
zeros in pooled memory: inside the outer time loop (for int t = 0; t < seqLen;
t++) or before it, detect t==0 and either zero-fill the "state" tensor (using
NumOps.Zero assignments) or set prevS to NumOps.Zero when t==0 and also write
that seeded state into "state" and "allStates" (or initialize
allStates[...,0,...]) so downstream code/backward passes see a defined t=0
state; update places that read prevS (the read at prevS = state[new[] { bi, hi,
di, ki }]) to use the initialized value.

In `@src/NeuralNetworks/Layers/SSM/RWKVLayer.cs`:
- Around line 340-353: The rented temporary tensor "shifted" is never read and
should be removed from the hot timestep loop: delete the
TensorAllocator.Rent<T>(...) call and all assignments to shifted[new[] { bi, d
}] inside the nested loops, and also remove any later references or
Return/Dispose calls for "shifted" if present; keep existing computations that
build rInput/kInput/vInput from x_t and xPrev and retain use of
_timeMixR/_timeMixK/_timeMixV so behavior is unchanged while removing the unused
allocation and writes.

---

Duplicate comments:
In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 314-359: The Workspace is currently allocated in the constructor
body making it shared when LayerBase<T>.Clone() does a shallow clone; move the
allocation and all Workspace.DeclareTimestep/DeclareSequence calls into a new
private InitializeWorkspace() method (create LayerWorkspace<T>(timestepCount:
10, sequenceCount: 12) and call the same Declare* entries there) and invoke
InitializeWorkspace() from the constructor and from the RWKV7Block<T>.Clone()
override so each cloned block gets its own LayerWorkspace<T> instead of aliasing
the original Workspace; update references to Workspace and keep the existing
Ts*/Sq* constants unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ec7688bf-ff16-4868-a285-34242858d5b2

📥 Commits

Reviewing files that changed from the base of the PR and between 674eb7d and be4cd7c.

📒 Files selected for processing (50)
  • src/Diffusion/Audio/ShortTimeFourierTransform.cs
  • src/NeuralNetworks/Layers/ConvLSTMLayer.cs
  • src/NeuralNetworks/Layers/DiffusionConvLayer.cs
  • src/NeuralNetworks/Layers/GRULayer.cs
  • src/NeuralNetworks/Layers/GraphTransformerLayer.cs
  • src/NeuralNetworks/Layers/GroupedQueryAttentionLayer.cs
  • src/NeuralNetworks/Layers/LSTMLayer.cs
  • src/NeuralNetworks/Layers/MeasurementLayer.cs
  • src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs
  • src/NeuralNetworks/Layers/SSM/ABCLayer.cs
  • src/NeuralNetworks/Layers/SSM/BASEDLayer.cs
  • src/NeuralNetworks/Layers/SSM/DeltaFormerLayer.cs
  • src/NeuralNetworks/Layers/SSM/DeltaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/DeltaProductLayer.cs
  • src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs
  • src/NeuralNetworks/Layers/SSM/HGRNLayer.cs
  • src/NeuralNetworks/Layers/SSM/HedgehogLayer.cs
  • src/NeuralNetworks/Layers/SSM/HyenaLayer.cs
  • src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/LinearRecurrentUnitLayer.cs
  • src/NeuralNetworks/Layers/SSM/LogLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/LonghornLayer.cs
  • src/NeuralNetworks/Layers/SSM/MEGALayer.cs
  • src/NeuralNetworks/Layers/SSM/Mamba2Block.cs
  • src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs
  • src/NeuralNetworks/Layers/SSM/MesaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/MinGRULayer.cs
  • src/NeuralNetworks/Layers/SSM/MinLSTMLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs
  • src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/PaTHAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/RWKV7Block.cs
  • src/NeuralNetworks/Layers/SSM/RWKVLayer.cs
  • src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs
  • src/NeuralNetworks/Layers/SSM/RebasedLayer.cs
  • src/NeuralNetworks/Layers/SSM/RetNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/RodimusLayer.cs
  • src/NeuralNetworks/Layers/SSM/S4DLayer.cs
  • src/NeuralNetworks/Layers/SSM/S5Layer.cs
  • src/NeuralNetworks/Layers/SSM/S6Scan.cs
  • src/NeuralNetworks/Layers/SSM/ScanPatterns.cs
  • src/NeuralNetworks/Layers/SSM/TTTLayer.cs
  • src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs
  • src/NeuralNetworks/Layers/SpikingLayer.cs

…terns + fixes

- 18 ones-tensor allocations eliminated via Engine.ScalarMinusTensor
  (ABCLayer, GatedDeltaNet, HGRN2, Hedgehog, Hyena, Kimi, LogLinear,
  MesaNet, MixtureOfMamba/Memories, MultiLatentAttention, Rebased,
  RetNet, Rodimus, TTT, TransNormerLLM)
- 75 bracket-syntax allocations converted to TensorAllocator.Rent
  (GraphTransformerLayer 13, GraphAttentionLayer 6, Megalodon 4,
  CapsuleLayer 2, ConvLSTMLayer 2, plus 30+ other layers)
- Fixed 20 SiLU derivative methods where ones variable was removed
  but still referenced (CreateOnesLike fallback)

Running totals: 1405 raw → ~1330 raw, 407 pooled

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (18)
src/NeuralNetworks/Layers/HeterogeneousGraphLayer.cs (1)

404-455: ⚠️ Potential issue | 🟠 Major

Rented output buffer is not lifecycle-safe with current reassignment pattern.

At Line 404, output is rented, but Line 454 reassigns output = Engine.TensorAdd(output, convOutput). If TensorAdd allocates (non-in-place), the originally rented buffer is orphaned and this path still allocates per edge type. That risks pool pressure and undermines the zero-allocation objective.

Suggested safe fix (until in-place accumulation/explicit return is wired)
-        var output = TensorAllocator.Rent<T>([batchSize, processNumNodes, _outputFeatures]);
+        var output = new Tensor<T>([batchSize, processNumNodes, _outputFeatures]);
         output.Fill(NumOps.Zero);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/HeterogeneousGraphLayer.cs` around lines 404 - 455,
The rented buffer "output" from TensorAllocator.Rent<T> is being orphaned by the
reassignment output = Engine.TensorAdd(output, convOutput), which can allocate a
new tensor each edge type; instead perform in-place accumulation into the rented
buffer to avoid allocations: replace the reassignment with an in-place add into
the existing "output" buffer (use the library's in-place primitive if available,
e.g., Engine.AddInto/Engine.TensorAddInPlace or a manual elementwise loop that
writes convOutput into "output"), ensuring you still use the same "output"
instance across iterations and only return/dispose it after the loop; update
code paths referencing output to not reassign it and keep Engine.TensorAdd usage
only if it has an explicit "target" parameter that writes into the rented
buffer.
src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs (4)

173-180: ⚠️ Potential issue | 🔴 Critical

BLOCKING: Missing GetMetadata method required by golden pattern.

Per the neural network guidelines, every layer must implement GetMetadata returning a dictionary with at minimum: ModelType, ParameterCount, Architecture, and InputShape/OutputShape.

🔧 Required implementation
+    /// <inheritdoc/>
+    public override Dictionary<string, object> GetMetadata()
+    {
+        return new Dictionary<string, object>
+        {
+            ["ModelType"] = "PiecewiseLinearEncodingLayer",
+            ["ParameterCount"] = ParameterCount,
+            ["Architecture"] = $"PLELayer(features={_numFeatures}, bins={_numBins})",
+            ["InputShape"] = InputShape,
+            ["OutputShape"] = OutputShape,
+            ["NumFeatures"] = _numFeatures,
+            ["NumBins"] = _numBins
+        };
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 173 -
180, The PiecewiseLinearEncodingLayer class is missing the required GetMetadata
method; add a public override IDictionary<string, object> GetMetadata() in
PiecewiseLinearEncodingLayer that returns a dictionary containing at minimum
"ModelType" (e.g., the layer type name or enum), "ParameterCount" (use
_binBoundaries.Length or result of GetParameters().Length), "Architecture" (a
short string describing the piecewise-linear encoding), and
"InputShape"/"OutputShape" (compute shapes based on existing input/output
properties or constructor parameters); ensure the method signature matches the
base Layer/LayerBase contract and use the same generic T where appropriate so
callers expecting metadata can consume these keys.

7-24: ⚠️ Potential issue | 🟡 Minor

Missing reference citation in XML documentation.

The golden pattern requires a <para><b>Reference:</b> section citing the original research paper. Piecewise Linear Encoding is from the TabM paper and should be cited.

📚 Add reference
 /// It's similar to how histograms work, but with soft (differentiable) boundaries.
 /// </para>
+/// <para>
+/// <b>Reference:</b> Gorishniy et al., "TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling" (2024).
+/// </para>
 /// </remarks>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 7 -
24, Add a Reference citation to the XML documentation for the
PiecewiseLinearEncodingLayer class: update the <remarks> block in
PiecewiseLinearEncodingLayer (file/class name: PiecewiseLinearEncodingLayer.cs /
class PiecewiseLinearEncodingLayer) to include a <para><b>Reference:</b> entry
that cites the TabM paper (author(s), title, venue/year and DOI or URL) so the
documentation follows the golden pattern and attributes the original research.

164-171: 🧹 Nitpick | 🔵 Trivial

ResetState should consider returning rented tensor to pool.

If TensorAllocator expects explicit returns, the cached _outputCache tensor should be returned here before nulling the reference. Otherwise you're leaking pooled memory.

♻️ Potential fix (if Return is needed)
 public override void ResetState()
 {
+    if (_outputCache != null)
+    {
+        TensorAllocator.Return(_outputCache);
+    }
     _inputCache = null;
     _outputCache = null;

     Engine.TensorFill(_binBoundaryGradients, NumOps.Zero);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 164 -
171, ResetState currently nulls cached tensors but doesn't return rented buffers
to the allocator; change ResetState to check for non-null cached tensors (at
minimum _outputCache, and consider _inputCache) and return them to the
TensorAllocator/engine (e.g., call the allocator's Return/Release method or
Engine-equivalent for rented tensors) before setting the fields to null, then
continue with Engine.TensorFill(_binBoundaryGradients, NumOps.Zero).

44-45: ⚠️ Potential issue | 🔴 Critical

BLOCKING: SupportsTraining is true but no Backward method exists.

This layer declares it supports training but provides absolutely no mechanism to backpropagate gradients. This is fundamentally broken. Per the golden pattern, every neural network layer must have a Backward method that accepts Tensor<T> outputGradient and returns Tensor<T>.

The gradient flow for piecewise linear encoding needs to:

  1. Compute gradients w.r.t. _binBoundaries (accumulate into _binBoundaryGradients)
  2. Compute gradients w.r.t. input features and return them

Without this, any training pipeline using this layer will fail or silently produce incorrect results.

🔧 Required implementation skeleton
+    /// <inheritdoc/>
+    public override Tensor<T> Backward(Tensor<T> outputGradient)
+    {
+        if (_inputCache == null)
+            throw new InvalidOperationException("Forward must be called before Backward");
+
+        int batchSize = _inputCache.Shape[0];
+        var inputGradient = TensorAllocator.Rent<T>([batchSize, _numFeatures]);
+
+        for (int b = 0; b < batchSize; b++)
+        {
+            for (int f = 0; f < _numFeatures; f++)
+            {
+                var value = _inputCache[b * _numFeatures + f];
+                var grad = ComputeFeatureGradient(value, f, outputGradient, b);
+                inputGradient[b * _numFeatures + f] = grad;
+            }
+        }
+
+        return inputGradient;
+    }
+
+    private T ComputeFeatureGradient(T value, int featureIdx, Tensor<T> outputGradient, int batchIdx)
+    {
+        int outputOffset = batchIdx * _numFeatures * _numBins + featureIdx * _numBins;
+        int boundaryOffset = featureIdx * (_numBins - 1);
+        var inputGrad = NumOps.Zero;
+
+        // Gradient through first bin
+        var firstBoundary = _binBoundaries[boundaryOffset];
+        var firstDiff = NumOps.Subtract(value, firstBoundary);
+        if (NumOps.Compare(firstDiff, NumOps.Zero) > 0 && NumOps.Compare(firstDiff, NumOps.One) < 0)
+        {
+            inputGrad = NumOps.Add(inputGrad, outputGradient[outputOffset]);
+            _binBoundaryGradients[boundaryOffset] = NumOps.Subtract(
+                _binBoundaryGradients[boundaryOffset], outputGradient[outputOffset]);
+        }
+
+        // TODO: Complete gradient computation for middle and last bins
+        // This requires tracking which path was taken in the Min operation
+
+        return inputGrad;
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs` around lines 44 -
45, The layer advertises training but lacks a Backward implementation; add a
public override Tensor<T> Backward(Tensor<T> outputGradient) in the
PiecewiseLinearEncodingLayer class that computes and returns the gradient w.r.t.
the layer inputs and accumulates gradients w.r.t. the bin boundaries into the
existing _binBoundaryGradients field; specifically, compute per-output-bin
contributions to ∂L/∂input by mapping outputGradient through the
piecewise-linear encoding's local slopes and compute ∂L/∂_binBoundaries by
summing the contribution where inputs sit at bin boundaries, updating
_binBoundaryGradients (not replacing it), and ensure the method signature and
behavior align with other layers' Backward contracts so SupportsTraining can
remain true.
src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs (2)

906-910: ⚠️ Potential issue | 🔴 Critical

Blocking: validate avgDegree before scaler division.

Line [909] and Line [918] rely on _avgDegree in division paths. If constructor input avgDegree <= 0, this can produce Inf/NaN and poison forward/backward values. Add a fail-fast guard in the constructor.

Proposed fix
 public PrincipalNeighbourhoodAggregationLayer(
     int inputFeatures,
     int outputFeatures,
     PNAAggregator[]? aggregators = null,
     PNAScaler[]? scalers = null,
     double avgDegree = 1.0,
     IActivationFunction<T>? activationFunction = null,
     IInitializationStrategy<T>? initializationStrategy = null)
     : base([inputFeatures], [outputFeatures], activationFunction ?? new IdentityActivation<T>())
 {
+    if (avgDegree <= 0.0)
+    {
+        throw new ArgumentOutOfRangeException(nameof(avgDegree), "avgDegree must be > 0.");
+    }
+
     InitializationStrategy = initializationStrategy ?? InitializationStrategies<T>.Eager;

As per coding guidelines “missing validation of external inputs” is a blocking production-readiness issue.

Also applies to: 915-919

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs` around
lines 906 - 910, The constructor(s) of PrincipalNeighbourhoodAggregationLayer
must validate the avgDegree parameter and fail-fast if it's not > 0 to prevent
Inf/NaN during division; add a guard in the class constructor(s) that sets
_avgDegree (e.g., verify avgDegree > 0 and throw an
ArgumentException/ArgumentOutOfRangeException with a clear message) and ensure
any overloads or factory methods that assign _avgDegree perform the same check
so that downstream code in TensorBroadcastDivide/Multiply (references:
_avgDegree, avgDegreeTensor, Engine.TensorBroadcastDivide, TensorAllocator.Rent)
never executes with non-positive avgDegree.

945-981: ⚠️ Potential issue | 🔴 Critical

Blocking: scaler backprop is mathematically incorrect for non-identity scalers.

Line [967]-[968] sets scalerGrad = gradSlice for all scalers. That drops derivative factors for PNAScaler.Amplification and PNAScaler.Attenuation, yielding incorrect gradients during training.

Proposed fix
-                // Backprop through scaler (simplified - assumes identity for gradient)
-                var scalerGrad = gradSlice;
+                // Backprop through scaler
+                Tensor<T> scalerGrad;
+                switch (_scalers[scalerIdx])
+                {
+                    case PNAScaler.Identity:
+                        scalerGrad = gradSlice;
+                        break;
+                    case PNAScaler.Amplification:
+                    {
+                        var avgDegreeTensor = TensorAllocator.Rent<T>([batchSize, numNodes, 1]);
+                        avgDegreeTensor.Fill(NumOps.FromDouble(_avgDegree));
+                        var factor = Engine.TensorBroadcastDivide(degExpanded, avgDegreeTensor); // degree / avgDegree
+                        scalerGrad = Engine.TensorBroadcastMultiply(gradSlice, factor);
+                        break;
+                    }
+                    case PNAScaler.Attenuation:
+                    {
+                        var avgDegreeTensor = TensorAllocator.Rent<T>([batchSize, numNodes, 1]);
+                        avgDegreeTensor.Fill(NumOps.FromDouble(_avgDegree));
+                        var factor = Engine.TensorBroadcastDivide(avgDegreeTensor, degExpanded); // avgDegree / degree
+                        scalerGrad = Engine.TensorBroadcastMultiply(gradSlice, factor);
+                        break;
+                    }
+                    default:
+                        scalerGrad = gradSlice;
+                        break;
+                }

As per coding guidelines “Simplified implementations ... missing proper logic” are blocking issues requiring immediate fix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs` around
lines 945 - 981, The code incorrectly sets scalerGrad = gradSlice for all
scalers; replace that with scaler-specific backprop: inspect the scaler kind via
_scalers[scalerIdx] and compute scalerGrad by multiplying gradSlice with the
derivative of the forward scaler transform (e.g., for PNAScaler.Amplification
multiply by the amplification factor s used in forward, for
PNAScaler.Attenuation multiply by the attenuation derivative — typically 1/s or
the attenuation factor used), using the Engine tensor multiply/broadcast helpers
and preserving shape ([batchSize, numNodes, _inputFeatures]); ensure you
retrieve the exact scalar/factor used in the forward pass (the same
variable/array used when applying scalers) and apply it with
Engine.TensorMultiply or TensorScalarMultiply before continuing
aggregation/backprop.
src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs (2)

46-48: ⚠️ Potential issue | 🔴 Critical

BLOCKING: Missing Backward method despite SupportsTraining => true.

This layer declares it supports training but has no Backward method implementation. Per the golden pattern guidelines, every layer that supports training must implement backpropagation.

The cached values _inputCache and _normalizedCache suggest backward was intended but never implemented. This is incomplete code that cannot be shipped.

You need to implement:

public override Tensor<T> Backward(Tensor<T> outputGradient)
{
    // Backprop through layer norm
    // Backprop through residual connection
    // Backprop through attention (transpose, attention scores, projections)
    // Return gradient w.r.t. input
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs` around lines 46 - 48,
The layer declares SupportsTraining => true but lacks the required Backward
implementation; implement public override Tensor<T> Backward(Tensor<T>
outputGradient) in IntersampleAttentionLayer to compute gradients through the
three main components: first backprop the residual connection (split
outputGradient into gradient to the attention output and passthrough to input
using the saved _inputCache), then backprop through the attention block by
reversing the projection steps (compute gradients w.r.t. output projection,
attention output, and input projections—queries/keys/values—using transposes of
the same weight matrices used in Forward, and backprop through attention
scores/softmax by using saved attention scores or by recomputing from cached
normalized inputs), and finally backprop through the LayerNorm using
_normalizedCache to produce gradient w.r.t. the pre-normalized input; ensure you
accumulate gradients into the returned Tensor<T> matching the input shape and
update any parameter gradient holders for the projection and output weight
matrices consistently with how Forward performed projection and residual
addition.

117-117: 🧹 Nitpick | 🔵 Trivial

Inconsistent allocation strategy within the same method.

Line 117 uses new Tensor<T>() for the output tensor while line 124 uses TensorAllocator.Rent<T>() for featureSlice. If the PR goal is zero-allocation forward passes, the output tensor should also use the pooled allocator for consistency.

Additionally, the intermediate tensors from projection layers (queries, keys, values) and operations (kT, scores, attended) may also contribute to allocation overhead.

♻️ Suggested improvement
-        var output = new Tensor<T>([numFeatures, batchSize, embDim]);
+        var output = TensorAllocator.Rent<T>([numFeatures, batchSize, embDim]);

Note: If this output tensor is returned from the method (and ultimately from Forward), you'll need a strategy to ensure the caller or the layer's lifecycle management returns it to the pool appropriately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs` at line 117, The
output tensor is allocated directly with new Tensor<T>(...) while other
temporaries use TensorAllocator.Rent<T>(), causing inconsistent allocation;
change the allocation of output to use TensorAllocator.Rent<T>(...) with the
same shape as new Tensor<T>([numFeatures, batchSize, embDim]) and ensure it is
returned to the pool by the caller or layer lifecycle if Forward returns it.
Also review the projection tensors (queries, keys, values) and intermediate
results (kT, scores, attended, featureSlice) and convert their allocations to
use TensorAllocator.Rent<T>() consistently to eliminate stray allocations;
update any ownership/return-to-pool semantics so rented tensors are released
appropriately after use or by the caller for returned tensors.
src/NeuralNetworks/Layers/QuantumLayer.cs (1)

218-228: 🧹 Nitpick | 🔵 Trivial

Inconsistent allocation strategy within Forward method.

Line 210 was converted to TensorAllocator.Rent<T>, but these intermediate tensors still use new Tensor<T>():

  • Line 218: imagState
  • Line 223: epsilonTensor

For consistent zero-allocation forward passes per PR objectives, these should also use pooled allocation.

♻️ Proposed fix for consistency
-        var imagState = new Tensor<T>(realState.Shape.ToArray());
+        var imagState = TensorAllocator.Rent<T>(realState.Shape.ToArray());

         // Normalize each batch item: divide by sqrt(sum(|state|^2) + eps)
         var magnitudeSquared = Engine.ComplexMagnitudeSquared(realState, imagState);
         var normPerBatch = Engine.ReduceSum(magnitudeSquared, [1], keepDims: true);
-        var epsilonTensor = new Tensor<T>(normPerBatch.Shape.ToArray());
+        var epsilonTensor = TensorAllocator.Rent<T>(normPerBatch.Shape.ToArray());
         epsilonTensor.Fill(NumOps.FromDouble(1e-10));

Note: Ensure proper return/disposal of these rented tensors as well (same concern as line 210).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/QuantumLayer.cs` around lines 218 - 228, The
Forward method mixes pooled and non-pooled allocations: replace the direct
allocations of imagState and epsilonTensor (created with new Tensor<T>) with
pooled allocations via TensorAllocator.Rent<T> (matching the change at line
210), and ensure you return/dispose these rented tensors (as you do for other
rented tensors) after use; update creation of
denomExpanded/normalizedReal/normalizedImag to use the rented tensors as inputs
and add the corresponding TensorAllocator.Return/Dispose calls for imagState and
epsilonTensor in the same cleanup path.
src/NeuralNetworks/Layers/BidirectionalLayer.cs (1)

52-54: ⚠️ Potential issue | 🟡 Minor

Remove orphaned XML documentation.

This XML comment describes "The computation engine (CPU or GPU) for vectorized operations" but there's no corresponding field or property defined after it. This appears to be leftover documentation from a removed or never-implemented member.

🧹 Proposed fix
-    /// <summary>
-    /// The computation engine (CPU or GPU) for vectorized operations.
-    /// </summary>
-
     /// <summary>
     /// Gets a value indicating whether this layer supports training.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/BidirectionalLayer.cs` around lines 52 - 54, Remove
the orphaned XML doc comment that reads "The computation engine (CPU or GPU) for
vectorized operations" in the BidirectionalLayer class; locate the stray
triple-slash XML block in src/NeuralNetworks/Layers/BidirectionalLayer.cs (near
the BidirectionalLayer class declaration) and delete it so there is no
unused/mismatched summary comment left without a corresponding field or
property. If the comment was meant for a member, instead move or reattach it to
the correct field/property; otherwise simply remove the orphaned XML block.
src/NeuralNetworks/Layers/FeatureTransformerLayer.cs (2)

59-60: ⚠️ Potential issue | 🔴 Critical

BLOCKING: SupportsTraining => true but no Backward method implemented.

Per the golden pattern for neural network classes: "Backward Method: Must accept Tensor<T> outputGradient and return Tensor<T>. Must implement actual backpropagation through all layers."

The layer claims training support and even caches _intermediateOutputs (line 57) for backprop, but there's no Backward implementation. This makes gradient-based training impossible. This is a blocking issue that must be resolved before merge.

Production-ready code requires a complete Backward method that:

  1. Accepts Tensor<T> outputGradient
  2. Backpropagates through step-specific layers (reverse order)
  3. Backpropagates through shared layers (reverse order)
  4. Applies GLU backward (gradient through sigmoid gate and element-wise multiply)
  5. Returns input gradient
📐 Skeleton for required Backward implementation
/// <summary>
/// Performs the backward pass through the Feature Transformer.
/// </summary>
/// <param name="outputGradient">Gradient from the next layer.</param>
/// <returns>Gradient with respect to the input.</returns>
public override Tensor<T> Backward(Tensor<T> outputGradient)
{
    var currentGrad = outputGradient;
    int totalLayers = _numSharedLayers + _numStepSpecificLayers;
    int layerIdx = totalLayers - 1;

    // Backward through step-specific layers (reverse order)
    for (int i = _numStepSpecificLayers - 1; i >= 0; i--)
    {
        // Undo residual scaling if applied
        if (_intermediateOutputs[layerIdx].Shape[1] == currentGrad.Shape[1] && layerIdx > 0)
        {
            currentGrad = Engine.TensorMultiplyScalar(currentGrad, NumOps.FromDouble(Math.Sqrt(0.5)));
            // Add residual gradient path
        }

        // GLU backward
        currentGrad = BackwardGLU(currentGrad, /* cached pre-GLU output */);

        // BN backward
        currentGrad = _stepBNLayers[i].Backward(currentGrad);

        // FC backward
        currentGrad = _stepFCLayers[i].Backward(currentGrad);

        layerIdx--;
    }

    // Backward through shared layers (reverse order)
    for (int i = _numSharedLayers - 1; i >= 0; i--)
    {
        // Similar backward logic for shared layers
        currentGrad = _sharedBNLayers[i].Backward(currentGrad);
        currentGrad = _sharedFCLayers[i].Backward(currentGrad);
        layerIdx--;
    }

    return currentGrad;
}

private Tensor<T> BackwardGLU(Tensor<T> outputGrad, Tensor<T> preGluInput)
{
    // Implement GLU backward: d/d(values) and d/d(gates) via chain rule
    // through sigmoid and element-wise multiply
    throw new NotImplementedException("Complete GLU backward implementation required");
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/FeatureTransformerLayer.cs` around lines 59 - 60,
FeatureTransformerLayer declares SupportsTraining => true but lacks the required
Backward implementation; implement public override Tensor<T> Backward(Tensor<T>
outputGradient) to perform full backprop using the cached _intermediateOutputs:
iterate reverse through step-specific layers then shared layers, undo any
residual scaling (apply Engine.TensorMultiplyScalar with
NumOps.FromDouble(Math.Sqrt(0.5)) where residuals were scaled), call each
layer's .Backward (e.g. _stepFCLayers[i].Backward, _stepBNLayers[i].Backward,
_sharedFCLayers[i].Backward, _sharedBNLayers[i].Backward), implement a
BackwardGLU helper to backprop through the GLU (compute gradients wrt gate via
sigmoid derivative and element-wise multiply for values and gates), decrement
layerIdx in sync with _intermediateOutputs and finally return the gradient
w.r.t. the input.

206-220: ⚠️ Potential issue | 🟠 Major

Rented tensors values and gates are never returned to the pool — resource leak.

The PR's goal is zero-allocation forward passes via pooled allocators. However, after values and gates are consumed by Engine.Sigmoid and Engine.TensorMultiply, they're abandoned without being returned. This defeats the pooling benefit and may cause unbounded memory growth under repeated forward passes.

🛠️ Proposed fix: Return rented tensors after use
     // Apply sigmoid to gates and multiply with values
     var sigmoidGates = Engine.Sigmoid(gates);
-    return Engine.TensorMultiply(values, sigmoidGates);
+    var result = Engine.TensorMultiply(values, sigmoidGates);
+
+    // Return rented tensors to the pool
+    TensorAllocator.Return(values);
+    TensorAllocator.Return(gates);
+
+    return result;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/FeatureTransformerLayer.cs` around lines 206 - 220,
The rented tensors values and gates (from TensorAllocator.Rent<T>) and the
intermediate sigmoidGates produced by Engine.Sigmoid are not returned to the
pool; wrap the computation in a try/finally, assign the final output from
Engine.TensorMultiply to a local (e.g., result), then in finally call the
allocator return method for values, gates and sigmoidGates (e.g.,
TensorAllocator.Return<T>(values), TensorAllocator.Return<T>(gates),
TensorAllocator.Return<T>(sigmoidGates)) before returning result; ensure you
reference the existing symbols Engine.Sigmoid, Engine.TensorMultiply,
TensorAllocator.Rent<T>, and the local variables values/gates/sigmoidGates when
applying the fix.
src/NeuralNetworks/Layers/DigitCapsuleLayer.cs (1)

438-458: ⚠️ Potential issue | 🔴 Critical

BLOCKING: Missing TensorAllocator.Return() calls — resource leak violating pooling contract.

This code violates AiDotNet's standard rent/return pooling lifecycle. The predictions (line 438), couplings (line 458), and output tensors are rented from the pool but never explicitly returned.

The pattern is further broken by variable reassignment without cleanup:

  • Line 479: output = activated; orphans the originally rented tensor
  • Line 486: couplings = Engine.TensorAdd(couplings, dot); orphans the originally rented tensor

AiDotNet's pooling contract (System.Buffers.ArrayPool-backed) requires: Rent → Use → Return buffer. Failure to return causes pool starvation and memory pressure in production workloads.

Every rented tensor must be explicitly returned before reassignment or method exit. Refactor to add Return calls:

Example fix pattern
var predictions = TensorAllocator.Rent<T>([...]);
try {
    // ... use predictions
} finally {
    TensorAllocator.Return(predictions);
}

Or restructure to avoid reassignment without explicit cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/DigitCapsuleLayer.cs` around lines 438 - 458, The
code rents tensors (predictions, couplings, output) via TensorAllocator.Rent but
never returns them, causing a pooling leak; update the DigitCapsuleLayer method
to ensure every TensorAllocator.Rent<T>(...) has a matching
TensorAllocator.Return(...) in all control-flow paths (including before any
reassignment like output = activated and couplings = Engine.TensorAdd(...)),
e.g., wrap usage in try/finally or return the original tensor before
reassigning, and ensure temporary tensors produced by operations such as
Engine.TensorAdd and any "activated" tensors are also returned when no longer
needed so that predictions, couplings, output, and any intermediates are always
returned to the pool.
src/NeuralNetworks/Layers/SpatialTransformerLayer.cs (1)

743-783: ⚠️ Potential issue | 🔴 Critical

BLOCKING: Rented tensor lacks proper lifecycle management for pooled allocation.

The tensor rented via TensorAllocator.Rent<T> is stored in _lastTransformationMatrix without proper disposal semantics. On subsequent forward passes, this field is overwritten without returning the previously rented tensor to the pool.

AiDotNet.Tensors uses pooled Memory<T> backing with IDisposable cleanup required to safely return buffers. Without explicit return, this causes:

  1. Pool exhaustion over repeated forward calls
  2. Data corruption if the pool reuses unreturned buffers before the layer finishes using them
  3. Memory bloat in production scenarios with many layer instances

The fix requires proper lifecycle management:

  • Either use try-finally to guarantee return of rented tensors, or
  • Wrap rented tensors in using scopes to ensure Dispose is called

Ensure the tensor is returned to the pool immediately after use in ConvertToTransformationMatrix, or redesign to allocate once and reuse safely without pooling semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SpatialTransformerLayer.cs` around lines 743 - 783,
ConvertToTransformationMatrix currently calls TensorAllocator.Rent<T> and
assigns the result (via tensor) to the layer field _lastTransformationMatrix
without returning/disposing any previously rented buffer, leaking pooled
Memory<T>. Fix by ensuring pooled tensors are returned to the allocator before
overwriting: before assigning _lastTransformationMatrix = tensor, check if
_lastTransformationMatrix is non-null and call its Dispose/Return method (or
TensorAllocator.Return) in a try/finally; alternatively, allocate a non-pooled
tensor or implement a reuse path that disposes the rented tensor when the layer
is disposed. Reference: ConvertToTransformationMatrix, TensorAllocator.Rent<T>,
and the field _lastTransformationMatrix.
src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs (2)

796-808: 🧹 Nitpick | 🔵 Trivial

Minor allocation from .ToArray() and continued syntax inconsistency.

Line 796 calls dOutput.Shape.ToArray() which allocates a new array on each backward call, partially defeating the zero-allocation goal. Consider caching or using a stackalloc pattern if Shape is a span-compatible type.

Additionally, line 808 uses collection expression [...] while line 796 uses .ToArray() - more syntax inconsistency.

-        var dPreNorm = TensorAllocator.Rent<T>(dOutput.Shape.ToArray());
+        var dPreNorm = TensorAllocator.Rent<T>([batchSize, seqLen, _emaDimension]);

(Using explicit dimensions avoids the .ToArray() allocation since batchSize, seqLen, and _emaDimension are already known.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs` around lines 796 - 808, The
allocation dPreNorm uses TensorAllocator.Rent<T>(dOutput.Shape.ToArray()) which
creates an unnecessary array each backward call; replace that call with an
explicit dimensions array using the known locals (e.g. new[] { batchSize,
seqLen, _emaDimension }) so it becomes TensorAllocator.Rent<T>(new[] {
batchSize, seqLen, _emaDimension }); also make the syntax consistent with the
later allocation for dEmaInput3D by using the same explicit new[] { ... } form
(or stackalloc if Shape is already a Span<int>) instead of mixing .ToArray() and
collection-expression syntax.

462-517: 🧹 Nitpick | 🔵 Trivial

Inconsistent syntax between collection expressions and array initializers.

CEMAKernelForward uses C# 12 collection expressions ([...]) while CEMAForward uses new[] {...} syntax. Both work, but consistency improves readability.

-        var statesReal = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _emaDimension });
-        var statesImag = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _emaDimension });
-        var outputPreNorm = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _emaDimension });
+        var statesReal = TensorAllocator.Rent<T>([batchSize, seqLen + 1, _emaDimension]);
+        var statesImag = TensorAllocator.Rent<T>([batchSize, seqLen + 1, _emaDimension]);
+        var outputPreNorm = TensorAllocator.Rent<T>([batchSize, seqLen, _emaDimension]);

Same applies to lines 516-517:

-        var outputNorm = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _emaDimension });
-        var stdInv = TensorAllocator.Rent<T>(new[] { batchSize, seqLen });
+        var outputNorm = TensorAllocator.Rent<T>([batchSize, seqLen, _emaDimension]);
+        var stdInv = TensorAllocator.Rent<T>([batchSize, seqLen]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs` around lines 462 - 517, The
code mixes C# 12 collection expressions and legacy array initializers (e.g.,
CEMAKernelForward uses [...] while CEMAForward and the allocations for
statesReal, statesImag, outputPreNorm, outputNorm, stdInv use new[] { ... }), so
make them consistent: pick the project's preferred style (prefer C# 12
collection expressions if the project targets that language version) and update
all tensor allocation and array/collection initializers in MegalodonLayer.cs
(references: statesReal, statesImag, outputPreNorm, outputNorm, stdInv) and the
methods CEMAKernelForward / CEMAForward to use the same syntax throughout.
Ensure the changed syntax compiles under the repository's LangVersion and run
the unit/build to verify.
src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs (1)

452-474: ⚠️ Potential issue | 🟠 Major

Memory leak in ResetState: Rented tensors should be returned before nulling.

When ResetState() is called, the fields _lastKVGate, _lastKVGateRaw, and _lastStates (which were allocated via TensorAllocator.Rent) are set to null without being returned to the pool. This leaks the rented memory.

🔧 Proposed fix to return rented tensors
 public override void ResetState()
 {
     _lastInput = null;
     _lastOutput = null;
     _lastQuery = null;
     _lastKey = null;
     _lastValue = null;
+    if (_lastKVGate != null)
+        TensorAllocator.Return(_lastKVGate);
     _lastKVGate = null;
+    if (_lastKVGateRaw != null)
+        TensorAllocator.Return(_lastKVGateRaw);
     _lastKVGateRaw = null;
     _lastOutputGate = null;
     _lastOutputGateRaw = null;
+    if (_lastStates != null)
+        TensorAllocator.Return(_lastStates);
     _lastStates = null;
     _lastRecurrenceOutput = null;
     _originalInputShape = null;
     // ... gradient fields unchanged
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs` around lines 452 -
474, The ResetState method currently nulls out rented tensors causing leaks;
before setting fields to null, check each rented buffer (at least _lastKVGate,
_lastKVGateRaw, and _lastStates) and return them to the allocator (e.g.,
TensorAllocator.Return or the project's equivalent) if they are non-null, then
set the fields to null; apply the same pattern for any other fields that hold
rented tensors (inspect _lastInput/_lastOutput/_lastQuery/_lastKey/_lastValue
etc.) to ensure all rented resources are returned before nulling.
♻️ Duplicate comments (4)
src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs (1)

407-408: ⚠️ Potential issue | 🔴 Critical

Blocking: initialize rented recurrence state before first timestep read.

This is a blocking correctness issue. Line 427 reads state[...] before any write in this method, and allStates[..., 0, ...] is never initialized. If TensorAllocator.Rent<T> returns non-zero memory, forward becomes nondeterministic.

💡 Suggested fix
         var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension });
         var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension });
+        state.Fill(NumOps.Zero);
+        for (int bi = 0; bi < batchSize; bi++)
+            for (int hi = 0; hi < _numHeads; hi++)
+                for (int di = 0; di < _headDimension; di++)
+                    for (int dj = 0; dj < _headDimension; dj++)
+                        allStates[new[] { bi, 0, hi, di, dj }] = NumOps.Zero;

As per coding guidelines, "Every PR must contain production-ready code" and "All methods have complete, production-ready implementations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs` around lines 407 - 408,
The rented tensors state and allStates must be initialized before any read:
after creating state with TensorAllocator.Rent<T>(...) and allStates with
TensorAllocator.Rent<T>(...), explicitly zero (or set to the correct initial
recurrence values) every element of state and set allStates[..., 0, ...] = state
(or the chosen initial values) so the first timestep read is deterministic;
locate the allocations around the state and allStates variables in the
Forward/Process method and add initialization logic that fills the buffers with
zeros or the appropriate initial hidden/state values before any indexed reads
occur.
src/NeuralNetworks/Layers/SSM/S5Layer.cs (1)

507-512: ⚠️ Potential issue | 🔴 Critical

Blocking: zero the rented recurrent state before the first scan step.

hReal/hImag are read at Line 523 and Line 524 before any write, and the [*, 0, *] slice of allHiddenReal/allHiddenImag is later consumed as the base state at Line 846 and Line 847. After switching from new Tensor<T> to TensorAllocator.Rent<T>, the scan can now depend on stale pooled contents.

Suggested fix
 var hReal = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension });
 var hImag = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension });
+hReal.Fill(NumOps.Zero);
+hImag.Fill(NumOps.Zero);
 var allHiddenReal = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _stateDimension });
 var allHiddenImag = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _stateDimension });
+for (int bi = 0; bi < batchSize; bi++)
+    for (int n = 0; n < _stateDimension; n++)
+    {
+        allHiddenReal[new[] { bi, 0, n }] = NumOps.Zero;
+        allHiddenImag[new[] { bi, 0, n }] = NumOps.Zero;
+    }

As per coding guidelines, "Every PR must contain production-ready code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs` around lines 507 - 512, The rented
tensors hReal, hImag and the buffers allHiddenReal/allHiddenImag returned by
TensorAllocator.Rent<T> can contain stale pooled data; before the first scan
step (before any reads at the scan start and before storing/consuming the
[*,0,*] slice) explicitly zero-fill hReal and hImag and zero the
[batch,0,stateDim] slice of allHiddenReal/allHiddenImag (or zero the entire
allocated buffers) so the scan and subsequent consumption use a deterministic
zero initial state; locate the allocations of hReal/hImag and
allHiddenReal/allHiddenImag and insert a clear/Fill(0) call immediately after
each Rent<T> and before the scan/first-read points.
src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs (1)

312-312: ⚠️ Potential issue | 🟠 Major

Still zero the masked half of _lastAttnWeights.

Only the causal half is written later, so this swap now depends on TensorAllocator.Rent<T> returning zeroed memory for tk > tq. Forward output is fine, but the cached _lastAttnWeights is no longer guaranteed to have zeros in masked positions.

Suggested fix
-        var allAttnWeights = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, seqLen, seqLen });
+        var allAttnWeights = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, seqLen, seqLen });
+        allAttnWeights.Fill(NumOps.Zero);

This verifies whether Rent clears buffers and shows the causal write pattern that leaves the upper triangle untouched:

Verification script
#!/bin/bash
set -euo pipefail

echo "== TensorAllocator implementation =="
fd -i 'TensorAllocator*.cs' . -x sed -n '1,240p' {}

echo
echo "== MLA causal attention writes =="
sed -n '307,373p' src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs` at line 312, The
rented buffer allAttnWeights from TensorAllocator.Rent<T> in
MultiLatentAttentionLayer currently relies on Rent returning zeros for masked
positions, but only the causal (lower-triangle) entries are written so
_lastAttnWeights may contain garbage in the masked half; fix by explicitly
zeroing the masked/upper-triangle region (or the entire tensor) immediately
after creating allAttnWeights (and before swapping into _lastAttnWeights) so
that entries where tk > tq are guaranteed zeroed regardless of TensorAllocator
behavior; locate the allocation site of allAttnWeights and ensure an explicit
zero-fill of the appropriate indices (or whole buffer) before further use.
src/NeuralNetworks/Layers/SSM/RWKV7Block.cs (1)

314-341: ⚠️ Potential issue | 🟠 Major

Clone() still shares workspace buffers — unresolved from previous review.

The workspace is created in InitializeParameters() but RWKV7Block<T> does not override Clone(). Since LayerBase<T>.Clone() uses MemberwiseClone(), cloned instances will share the same Workspace reference, causing buffer corruption when both instances run forward passes.

Factor the workspace initialization into a helper method and call it from both the constructor and a new Clone() override:

🔧 Proposed fix
+ private void InitializeWorkspace()
+ {
+     Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12);
+     // TimeMixing timestep buffers
+     Workspace.DeclareTimestep(TsRInput, _modelDimension);
+     Workspace.DeclareTimestep(TsKInput, _modelDimension);
+     Workspace.DeclareTimestep(TsVInput, _modelDimension);
+     Workspace.DeclareTimestep(TsAInput, _modelDimension);
+     Workspace.DeclareTimestep(TsBInput, _modelDimension);
+     Workspace.DeclareTimestep(TsWkvOut, _modelDimension);
+     Workspace.DeclareTimestep(TsYt, _modelDimension);
+     // ChannelMixing timestep buffers
+     Workspace.DeclareTimestep(TsCmRInput, _modelDimension);
+     Workspace.DeclareTimestep(TsCmKInput, _modelDimension);
+     Workspace.DeclareTimestep(TsCmKSiLU, _ffnDimension);
+     // TimeMixing sequence buffers
+     Workspace.DeclareSequence(SqAllR, _modelDimension);
+     Workspace.DeclareSequence(SqAllK, _modelDimension);
+     Workspace.DeclareSequence(SqAllV, _modelDimension);
+     Workspace.DeclareSequence(SqAllA, _modelDimension);
+     Workspace.DeclareSequence(SqAllB, _modelDimension);
+     Workspace.DeclareSequence(SqAllWkv, _modelDimension);
+     Workspace.DeclareSequence(SqAllWkvPre, _modelDimension);
+     Workspace.DeclareSequence(SqAllWkvGated, _modelDimension);
+     // ChannelMixing sequence buffers
+     Workspace.DeclareSequence(SqCmAllRGate, _modelDimension);
+     Workspace.DeclareSequence(SqCmAllVProj, _modelDimension);
+     Workspace.DeclareSequence(SqCmAllSiLU, _ffnDimension);
+     Workspace.DeclareSequence(SqCmAllKProj, _ffnDimension);
+ }
+
+ public override LayerBase<T> Clone()
+ {
+     var clone = (RWKV7Block<T>)MemberwiseClone();
+     clone.InitializeWorkspace();
+     return clone;
+ }

  private void InitializeParameters()
  {
      // ... existing parameter init ...
-     // Arena workspace: pre-allocate forward pass buffers for zero-allocation hot path
-     Workspace = new LayerWorkspace<T>(timestepCount: 10, sequenceCount: 12);
-     // ... all DeclareTimestep/DeclareSequence calls ...
+     InitializeWorkspace();
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs` around lines 314 - 341,
RWKV7Block<T> currently creates a LayerWorkspace<T> in InitializeParameters()
but does not override Clone(), so LayerBase<T>.Clone()'s MemberwiseClone()
leaves cloned instances sharing the same Workspace; fix by adding a private
helper (e.g., InitializeWorkspace()) that performs the Workspace = new
LayerWorkspace<T>(...) and DeclareTimestep/DeclareSequence calls (the same logic
currently in InitializeParameters), call that helper from InitializeParameters()
and the class constructor, and override Clone() in RWKV7Block<T> to call
base.Clone(), cast to RWKV7Block<T>, then invoke InitializeWorkspace() on the
cloned instance to replace the shared Workspace with a fresh one before
returning the clone; ensure you reference the Workspace field and methods
TsRInput/TsKInput/.../SqCmAllKProj so the same buffers are declared for the
clone.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f7a8a0db-8133-48a9-93fe-1246db4c6a6d

📥 Commits

Reviewing files that changed from the base of the PR and between be4cd7c and 1e3a8bd.

📒 Files selected for processing (54)
  • src/NeuralNetworks/Layers/ALiBiPositionalBiasLayer.cs
  • src/NeuralNetworks/Layers/AttentionLayer.cs
  • src/NeuralNetworks/Layers/BatchEnsembleLayer.cs
  • src/NeuralNetworks/Layers/BidirectionalLayer.cs
  • src/NeuralNetworks/Layers/CapsuleLayer.cs
  • src/NeuralNetworks/Layers/ConditionalRandomFieldLayer.cs
  • src/NeuralNetworks/Layers/ConvLSTMLayer.cs
  • src/NeuralNetworks/Layers/DigitCapsuleLayer.cs
  • src/NeuralNetworks/Layers/DirectionalGraphLayer.cs
  • src/NeuralNetworks/Layers/EdgeConditionalConvolutionalLayer.cs
  • src/NeuralNetworks/Layers/FeatureTransformerLayer.cs
  • src/NeuralNetworks/Layers/GRULayer.cs
  • src/NeuralNetworks/Layers/GraphAttentionLayer.cs
  • src/NeuralNetworks/Layers/GraphConvolutionalLayer.cs
  • src/NeuralNetworks/Layers/GraphIsomorphismLayer.cs
  • src/NeuralNetworks/Layers/GraphTransformerLayer.cs
  • src/NeuralNetworks/Layers/HeterogeneousGraphLayer.cs
  • src/NeuralNetworks/Layers/HyperbolicLinearLayer.cs
  • src/NeuralNetworks/Layers/InteractingLayer.cs
  • src/NeuralNetworks/Layers/IntersampleAttentionLayer.cs
  • src/NeuralNetworks/Layers/MemoryWriteLayer.cs
  • src/NeuralNetworks/Layers/MessagePassingLayer.cs
  • src/NeuralNetworks/Layers/ObliviousDecisionTreeLayer.cs
  • src/NeuralNetworks/Layers/PiecewiseLinearEncodingLayer.cs
  • src/NeuralNetworks/Layers/PrincipalNeighbourhoodAggregationLayer.cs
  • src/NeuralNetworks/Layers/QuantumLayer.cs
  • src/NeuralNetworks/Layers/RepParameterizationLayer.cs
  • src/NeuralNetworks/Layers/SSM/ABCLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs
  • src/NeuralNetworks/Layers/SSM/HedgehogLayer.cs
  • src/NeuralNetworks/Layers/SSM/HybridBlockScheduler.cs
  • src/NeuralNetworks/Layers/SSM/HyenaLayer.cs
  • src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/LogLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/LonghornLayer.cs
  • src/NeuralNetworks/Layers/SSM/MegalodonLayer.cs
  • src/NeuralNetworks/Layers/SSM/MesaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs
  • src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/RWKV7Block.cs
  • src/NeuralNetworks/Layers/SSM/RebasedLayer.cs
  • src/NeuralNetworks/Layers/SSM/RetNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/RodimusLayer.cs
  • src/NeuralNetworks/Layers/SSM/S4DLayer.cs
  • src/NeuralNetworks/Layers/SSM/S5Layer.cs
  • src/NeuralNetworks/Layers/SSM/TTTLayer.cs
  • src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs
  • src/NeuralNetworks/Layers/SoftTreeLayer.cs
  • src/NeuralNetworks/Layers/SpatialTransformerLayer.cs
  • src/NeuralNetworks/Layers/SpikingLayer.cs

…dation, docs

Critical correctness fixes:
- Reverted 21 recurrent state buffer allocations back to new Tensor<T>()
  for guaranteed zero initialization (ExtendedLSTM, HGRN, MEGA, MinLSTM,
  MixtureOfMemories, MultiLatentAttention, RealGatedLinearRecurrence,
  Rebased gradient accumulators, S5, TransNormerLLM)
- ForwardArena.Reset: avoid concurrent modification by copying Keys first
- ForwardArena.RentUninitialized: no longer clears on grow path
- ShapeKey: supports arbitrary rank (was limited to 4 dims)
- LayerWorkspace: constructor and DeclareTimestep/Sequence validate bounds
- RWKV7Block: moved BeginForward to main Forward() so ChannelMixing also works
- RWKV7Block: removed unused TsYt buffer, cleaned up wkvOutput null check
- LayerBase: fixed duplicate <summary> XML tag
- Benchmark: includes BeginForward in workspace measurement

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 02:23
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 5, 2026

Deployment failed with the following error:

Resource is limited - try again in 1 day (more than 100, code: "api-deployments-free-per-day").

Learn More: https://vercel.com/franklins-projects-02a0b5a0?upgradeToPro=build-rate-limit

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 80 out of 80 changed files in this pull request and generated 19 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (6)
src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs (1)

527-530: ⚠️ Potential issue | 🟠 Major

Remove remaining raw allocation in SiLU derivative hot path.

new Tensor<T>(...) here reintroduces avoidable allocation churn. Keep this path pooled to stay consistent with the PR’s zero-allocation/perf goal.

♻️ Suggested fix
 private Tensor<T> ComputeSiLUDerivative(Tensor<T> x)
 {
     var sig = Engine.Sigmoid(x);
     var oneMinusSig = Engine.ScalarMinusTensor(NumOps.One, sig);
     var xTimesOneMinusSig = Engine.TensorMultiply(x, oneMinusSig);
-    var ones_t = new Tensor<T>(xTimesOneMinusSig.Shape.ToArray()); ones_t.Fill(NumOps.One);
-    var onePlusXSig = Engine.TensorAdd(ones_t, xTimesOneMinusSig);
+    var ones = TensorAllocator.Rent<T>(xTimesOneMinusSig.Shape.ToArray());
+    ones.Fill(NumOps.One);
+    var onePlusXSig = Engine.TensorAdd(ones, xTimesOneMinusSig);
     return Engine.TensorMultiply(sig, onePlusXSig);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs` around lines 527 - 530,
Replace the raw allocation new Tensor<T>(...) used to create ones_t with the
tensor-pooling API so the SiLU derivative hot path stays zero-allocation:
instead of "var ones_t = new Tensor<T>(xTimesOneMinusSig.Shape.ToArray());
ones_t.Fill(NumOps.One);" acquire a pooled tensor with the same shape (using the
Engine method that returns a pooled tensor for a given shape or a "like" tensor,
e.g. Engine.TensorLike/Engine.CreatePooledTensor/Engine.AcquireTensorWithShape
of xTimesOneMinusSig), call Fill(NumOps.One) on that pooled tensor, use it in
the existing Engine.TensorAdd to produce onePlusXSig, and ensure you follow the
layer's tensor lifetime conventions (release/Dispose the pooled tensor if
required) to avoid reintroducing allocation churn.
src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs (1)

288-288: ⚠️ Potential issue | 🔴 Critical

Blocking: zero-initialize rented recurrent state h.

Line 318 reads h before any write in timestep 0. With TensorAllocator.Rent, that can pull stale values and corrupt recurrence determinism.

🐛 Suggested fix
-        var h = TensorAllocator.Rent<T>(new[] { batchSize, _recurrenceDimension });
+        var h = TensorAllocator.Rent<T>(new[] { batchSize, _recurrenceDimension });
+        h.Fill(NumOps.Zero);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs` at line 288,
The rented recurrent state h (allocated with TensorAllocator.Rent<T>(...)) can
contain stale memory and must be zero-initialized before use in
RealGatedLinearRecurrenceLayer; change the allocation site that creates h to
zero it immediately after Rent (e.g., call the allocator's zeroing variant or
invoke a Clear/Fill(0) on h) so timestep 0 doesn't read uninitialized values and
recurrence determinism is preserved.
src/NeuralNetworks/Layers/SSM/HGRNLayer.cs (1)

320-320: ⚠️ Potential issue | 🔴 Critical

Blocking: initialize rented hidden state to zero before recurrence.

Line 335 consumes h as previous state on timestep 0. With pooled memory, that can be stale unless explicitly zeroed.

🐛 Suggested fix
-        var h = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension });
+        var h = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension });
+        h.Fill(NumOps.Zero);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/HGRNLayer.cs` at line 320, The rented tensor h
(created via TensorAllocator.Rent<T>) in HGRNLayer.cs may contain stale pooled
memory and must be zero-initialized before being used as the previous hidden
state on timestep 0; after calling TensorAllocator.Rent<T>(new[] { batchSize,
_modelDimension }) set all elements of h to zero (e.g., a zero-fill or
Clear/Fill method appropriate to the tensor type) immediately after allocation
and before the recurrence loop, so the initial hidden state is deterministic.
src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs (1)

302-302: ⚠️ Potential issue | 🔴 Critical

Blocking: clear h_t after rent before additive accumulation.

Line 346 accumulates into h_t (+= semantics). Without zero-init, pooled residue contaminates outputs.

🐛 Suggested fix
-            var h_t = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension });
+            var h_t = TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension });
+            h_t.Fill(NumOps.Zero);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs` at line 302, The rented
tensor h_t from TensorAllocator.Rent<T>(...) must be zero-initialized before any
additive accumulation (you later do += into h_t), so immediately after var h_t =
TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }); clear its
contents to default/zero; do this by calling the allocator/ tensor clear helper
(e.g., TensorAllocator.Clear(h_t) or h_t.Fill(default(T))) or, if the tensor
exposes a buffer, Array.Clear(h_t.Buffer, 0, h_t.Length) so pooled residue
cannot contaminate later += operations.
src/NeuralNetworks/Layers/SSM/S5Layer.cs (1)

757-793: ⚠️ Potential issue | 🟠 Major

Do not make these gradient accumulators depend on allocator clearing.

dU, dhReal, and dhImag are accumulation buffers. The first writes at Lines 817, 826, 828, and 904 are additive, so this change is only safe if TensorAllocator.Rent<T>() is guaranteed to return cleared memory. Zero them locally after Rent() or keep raw allocation here.

Minimal safe change
 var dU = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _modelDimension });
+dU.Fill(NumOps.Zero);
 ...
 var dhReal = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension });
 var dhImag = TensorAllocator.Rent<T>(new[] { batchSize, _stateDimension });
+dhReal.Fill(NumOps.Zero);
+dhImag.Fill(NumOps.Zero);

As per coding guidelines, "Every PR must contain production-ready code."

Run this to confirm what TensorAllocator.Rent<T>() guarantees today:

#!/bin/bash
set -euo pipefail

allocator_file="$(fd -i 'TensorAllocator.cs' src | head -n1)"
printf 'TensorAllocator file: %s\n\n' "$allocator_file"
rg -n -C4 'class TensorAllocator|Rent<|Clear|Fill|Zero' "$allocator_file"

printf '\nAccumulator sites in S5Layer:\n\n'
sed -n '750,910p' src/NeuralNetworks/Layers/SSM/S5Layer.cs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs` around lines 757 - 793, The buffers
dU, dhReal, and dhImag are used as accumulation buffers but are obtained via
TensorAllocator.Rent<T>(), which may return non-zero memory; zero them
immediately after renting to avoid relying on allocator clearing. Locate the
Rent calls for dU, dhReal, and dhImag in S5Layer (references: dU, dhReal,
dhImag, TensorAllocator.Rent<T>()) and call a local zeroing operation (e.g.,
fill/clear the rented tensor memory to 0) before any additive writes occur, or
switch to a raw/new allocation that guarantees zeroed contents.
src/Memory/LayerWorkspace.cs (1)

32-33: ⚠️ Potential issue | 🔴 Critical

Blocking: validate sizes and slot indices before the fast path.

A fresh BeginForward(0, 0) returns immediately because _lastBatchSize and _lastSeqLen also start at 0, so nothing gets allocated and the next accessor fails later with InvalidOperationException. Timestep() and Sequence() also surface raw IndexOutOfRangeException for bad slots instead of a stable contract.

Minimal guardrails
-    private int _lastBatchSize;
-    private int _lastSeqLen;
+    private int _lastBatchSize = -1;
+    private int _lastSeqLen = -1;
...
     public void BeginForward(int batchSize, int seqLen = 1)
     {
+        if (batchSize <= 0)
+            throw new ArgumentOutOfRangeException(nameof(batchSize));
+        if (seqLen <= 0)
+            throw new ArgumentOutOfRangeException(nameof(seqLen));
         if (batchSize == _lastBatchSize && seqLen == _lastSeqLen)
             return;
...
-    public Tensor<T> Timestep(int index) => _timestepBuffers[index]
+    public Tensor<T> Timestep(int index)
+    {
+        if ((uint)index >= (uint)_timestepBuffers.Length)
+            throw new ArgumentOutOfRangeException(nameof(index));
+        return _timestepBuffers[index]
         ?? throw new InvalidOperationException($"Timestep buffer {index} not allocated. Call BeginForward first.");
+    }
...
-    public Tensor<T> Sequence(int index) => _sequenceBuffers[index]
+    public Tensor<T> Sequence(int index)
+    {
+        if ((uint)index >= (uint)_sequenceBuffers.Length)
+            throw new ArgumentOutOfRangeException(nameof(index));
+        return _sequenceBuffers[index]
         ?? throw new InvalidOperationException($"Sequence buffer {index} not allocated. Call BeginForward first.");
+    }

As per coding guidelines, "Every PR must contain production-ready code" and production-ready code must not ship with "missing validation of external inputs".

Also applies to: 75-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Memory/LayerWorkspace.cs` around lines 32 - 33, BeginForward currently
fast-paths when batch/seq equal _lastBatchSize/_lastSeqLen which allows
BeginForward(0,0) to return without allocations and later accessors throw; add
explicit validation at the start of BeginForward(batchSize, seqLen) to reject
non-positive sizes or to allocate/initialize state when either argument differs
from _lastBatchSize/_lastSeqLen (update _lastBatchSize/_lastSeqLen only after
successful allocation). Also add bounds checks in Timestep(int slot) and
Sequence(int slot) to verify slot is within [0, _lastSeqLen) (or the correct
per-batch/per-seq range) and throw a stable InvalidOperationException with a
clear message if invalid instead of letting IndexOutOfRangeException surface;
reference these symbols: BeginForward, _lastBatchSize, _lastSeqLen, Timestep,
Sequence, and the accessor that reads slots.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs`:
- Around line 72-75: Update the stale XML doc comment in
ArenaAllocationBenchmarks.cs that describes the Baseline allocation count:
recalc using TimestepsPerForward (which is 32) and change the text to state
"Creates 8 sequence + 7×32 timestep tensors = 232 allocations per forward pass"
(or equivalent wording and formula showing 8 + 7×TimestepsPerForward = 232) so
the comment matches the actual benchmark parameters; locate the comment near the
ArenaAllocationBenchmarks class/method where TimestepsPerForward is referenced
and replace the incorrect 904 value and formula.

In `@src/Memory/ForwardArena.cs`:
- Around line 14-15: Change the public API surface by making the
allocator/internal memory types non-public: change the ForwardArena<T> class and
the ShapeKey type from public to internal (and any other occurrences at the
referenced location) so they are not exposed to consumers; ensure any
constructors, methods, or nested types remain internal, and if external tests or
assemblies need access, grant it via InternalsVisibleTo rather than keeping
those types public.
- Around line 137-159: The ShapeKey constructor currently stores the incoming
mutable array reference (_dims = shape) which allows external mutation to break
dictionary behavior; change the constructor in ShapeKey to make a defensive copy
of the provided int[] (e.g., allocate a new int[shape.Length] and copy elements
or use shape.Clone/Array.Copy) and assign that copy to _dims, and also validate
shape is not null; leave GetHashCode(), Equals(object), and Equals(ShapeKey)
as-is so hashing and equality operate on the immutable internal copy.
- Around line 72-77: The Reset method allocates by creating a temporary List
from _cursors.Keys; instead iterate the dictionary's key collection directly to
avoid the allocation. Modify Reset (method Reset) to loop over _cursors.Keys
(the Dictionary<ShapeKey, ...>.KeyCollection) and set _cursors[key] = 0 for each
key, ensuring you do not add/remove keys (only update values) so the key
collection iteration remains valid and allocation-free.

In `@src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs`:
- Around line 536-540: The current SiLU-derivative code allocates a full-size
ones tensor via ones_t and onePlusXSig; instead compute sig * (1 + x*(1-sig))
algebraically as sig + x * sig * (1 - sig) to avoid the allocation: keep using
sig and oneMinusSig (from Engine.ScalarMinusTensor(NumOps.One, sig)), compute
temp = sig * oneMinusSig (using Engine.TensorMultiply), then xTimesTemp = x *
temp, and finally return Engine.TensorAdd(sig, xTimesTemp); update the code
around variables sig, oneMinusSig, x, TensorMultiply and TensorAdd to follow
this allocation-free sequence.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 671-673: ApplyGroupNorm currently allocates a new shape array each
call via input.Shape.ToArray(), reintroducing GC pressure in the per-token
TimeMixingForward loop; change ApplyGroupNorm to accept a precomputed/timestep
workspace shape or a pre-rented output Tensor from a caller-managed timestep
buffer (or expose a cached shape array) so it calls TensorAllocator.Rent with
that cached shape instead of input.Shape.ToArray(), and update callers (e.g.,
TimeMixingForward) to pass the workspace buffer or precomputed shape; keep
references: ApplyGroupNorm, TimeMixingForward, TensorAllocator.Rent, and
input.Shape.ToArray().

---

Duplicate comments:
In `@src/Memory/LayerWorkspace.cs`:
- Around line 32-33: BeginForward currently fast-paths when batch/seq equal
_lastBatchSize/_lastSeqLen which allows BeginForward(0,0) to return without
allocations and later accessors throw; add explicit validation at the start of
BeginForward(batchSize, seqLen) to reject non-positive sizes or to
allocate/initialize state when either argument differs from
_lastBatchSize/_lastSeqLen (update _lastBatchSize/_lastSeqLen only after
successful allocation). Also add bounds checks in Timestep(int slot) and
Sequence(int slot) to verify slot is within [0, _lastSeqLen) (or the correct
per-batch/per-seq range) and throw a stable InvalidOperationException with a
clear message if invalid instead of letting IndexOutOfRangeException surface;
reference these symbols: BeginForward, _lastBatchSize, _lastSeqLen, Timestep,
Sequence, and the accessor that reads slots.

In `@src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs`:
- Line 302: The rented tensor h_t from TensorAllocator.Rent<T>(...) must be
zero-initialized before any additive accumulation (you later do += into h_t), so
immediately after var h_t = TensorAllocator.Rent<T>(new[] { batchSize,
_modelDimension }); clear its contents to default/zero; do this by calling the
allocator/ tensor clear helper (e.g., TensorAllocator.Clear(h_t) or
h_t.Fill(default(T))) or, if the tensor exposes a buffer,
Array.Clear(h_t.Buffer, 0, h_t.Length) so pooled residue cannot contaminate
later += operations.

In `@src/NeuralNetworks/Layers/SSM/HGRNLayer.cs`:
- Line 320: The rented tensor h (created via TensorAllocator.Rent<T>) in
HGRNLayer.cs may contain stale pooled memory and must be zero-initialized before
being used as the previous hidden state on timestep 0; after calling
TensorAllocator.Rent<T>(new[] { batchSize, _modelDimension }) set all elements
of h to zero (e.g., a zero-fill or Clear/Fill method appropriate to the tensor
type) immediately after allocation and before the recurrence loop, so the
initial hidden state is deterministic.

In `@src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs`:
- Line 288: The rented recurrent state h (allocated with
TensorAllocator.Rent<T>(...)) can contain stale memory and must be
zero-initialized before use in RealGatedLinearRecurrenceLayer; change the
allocation site that creates h to zero it immediately after Rent (e.g., call the
allocator's zeroing variant or invoke a Clear/Fill(0) on h) so timestep 0
doesn't read uninitialized values and recurrence determinism is preserved.

In `@src/NeuralNetworks/Layers/SSM/S5Layer.cs`:
- Around line 757-793: The buffers dU, dhReal, and dhImag are used as
accumulation buffers but are obtained via TensorAllocator.Rent<T>(), which may
return non-zero memory; zero them immediately after renting to avoid relying on
allocator clearing. Locate the Rent calls for dU, dhReal, and dhImag in S5Layer
(references: dU, dhReal, dhImag, TensorAllocator.Rent<T>()) and call a local
zeroing operation (e.g., fill/clear the rented tensor memory to 0) before any
additive writes occur, or switch to a raw/new allocation that guarantees zeroed
contents.

In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs`:
- Around line 527-530: Replace the raw allocation new Tensor<T>(...) used to
create ones_t with the tensor-pooling API so the SiLU derivative hot path stays
zero-allocation: instead of "var ones_t = new
Tensor<T>(xTimesOneMinusSig.Shape.ToArray()); ones_t.Fill(NumOps.One);" acquire
a pooled tensor with the same shape (using the Engine method that returns a
pooled tensor for a given shape or a "like" tensor, e.g.
Engine.TensorLike/Engine.CreatePooledTensor/Engine.AcquireTensorWithShape of
xTimesOneMinusSig), call Fill(NumOps.One) on that pooled tensor, use it in the
existing Engine.TensorAdd to produce onePlusXSig, and ensure you follow the
layer's tensor lifetime conventions (release/Dispose the pooled tensor if
required) to avoid reintroducing allocation churn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 27f7164a-d448-401f-8e50-caeb21a66af7

📥 Commits

Reviewing files that changed from the base of the PR and between 1e3a8bd and 554b23f.

📒 Files selected for processing (15)
  • AiDotNetBenchmarkTests/BenchmarkTests/ArenaAllocationBenchmarks.cs
  • src/Memory/ForwardArena.cs
  • src/Memory/LayerWorkspace.cs
  • src/NeuralNetworks/Layers/LayerBase.cs
  • src/NeuralNetworks/Layers/SSM/ExtendedLSTMLayer.cs
  • src/NeuralNetworks/Layers/SSM/HGRNLayer.cs
  • src/NeuralNetworks/Layers/SSM/MEGALayer.cs
  • src/NeuralNetworks/Layers/SSM/MinLSTMLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs
  • src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/RWKV7Block.cs
  • src/NeuralNetworks/Layers/SSM/RealGatedLinearRecurrenceLayer.cs
  • src/NeuralNetworks/Layers/SSM/RebasedLayer.cs
  • src/NeuralNetworks/Layers/SSM/S5Layer.cs
  • src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs

…leanup

SiLU derivative optimization (18 files):
- Replaced ones_t tensor allocation + Fill(One) + TensorAdd pattern with
  Engine.TensorAddScalar(x, NumOps.One) — zero allocation for 1+x computation
- ABCLayer, GatedDeltaNet, GatedDeltaProduct, GatedSlotAttention, HGRN2,
  Hedgehog, Hyena, Kimi, LogLinear, MesaNet, MixtureOfMamba/Memories,
  MultiLatentAttention, Rebased, RetNet, Rodimus, TransNormerLLM, TTT

ForwardArena fixes:
- Reset() uses pre-cached keys array instead of allocating List each call
- ShapeKey defensive-copies the shape array to prevent mutation
- Made ForwardArena internal (implementation detail, not public API)

Other fixes:
- RWKV7Block: removed unused TsYt constant
- DiffusionConvLayer: removed duplicate using directive
- S6Scan: eliminated wasted Rent in initialState branch (clone instead)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
src/NeuralNetworks/Layers/SSM/RodimusLayer.cs (1)

485-499: ⚠️ Potential issue | 🔴 Critical

Blocking: remove the unused CreateOnesLike helper.

The SiLU rewrite at Lines 488-490 makes Lines 494-499 dead code. Remove it before merge.

🧹 Suggested cleanup
-    private Tensor<T> CreateOnesLike(Tensor<T> template)
-    {
-        var ones = new Tensor<T>(template.Shape.ToArray());
-        ones.Fill(NumOps.One);
-        return ones;
-    }

As per coding guidelines, "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs` around lines 485 - 499, Remove
the dead CreateOnesLike helper: delete the private method
CreateOnesLike(Tensor<T> template) since ComputeSiLUDerivative now computes SiLU
without it; search for any remaining references to CreateOnesLike and remove or
replace them, and run build/tests to ensure no other code depended on it (also
remove any now-unused imports or fields that existed solely for that helper).
src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs (1)

339-366: ⚠️ Potential issue | 🟠 Major

This forward path still allocates inside LightningAttentionForward.

Lines 340-366 remove some intermediates, but Forward() still reaches new Tensor<T>() for state and allStates at Lines 407-408 on every call. That keeps this layer off the zero-allocation path this PR is targeting.

♻️ Suggested follow-up
-        var state = new Tensor<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension });
-        var allStates = new Tensor<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension });
+        var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension });
+        var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension });
+        state.Fill(NumOps.Zero);
+        for (int bi = 0; bi < batchSize; bi++)
+            for (int hi = 0; hi < _numHeads; hi++)
+                for (int di = 0; di < _headDimension; di++)
+                    for (int dj = 0; dj < _headDimension; dj++)
+                        allStates[new[] { bi, 0, hi, di, dj }] = NumOps.Zero;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs` around lines 339 - 366,
The forward path still allocates inside LightningAttentionForward (Forward calls
new Tensor<T> for state and allStates), so change LightningAttentionForward to
avoid internal allocations by adding parameters (e.g., out or preallocated
Tensor parameters like stateOut, allStatesOut) or accept rented buffers and
write into them; update Forward to rent those tensors via
TensorAllocator.Rent<T>(...) and pass them into LightningAttentionForward (or
provide a reusable per-batch buffer on the layer) so no new Tensor<T>() is
created inside LightningAttentionForward; ensure _last... fields are set to the
rented/result tensors and release/return buffers consistently.
src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs (1)

388-404: ⚠️ Potential issue | 🔴 Critical

Blocking: drop the leftover CreateOnesLike helper.

The scalar-based SiLU rewrite made Lines 399-404 unreachable. Remove the helper in the same change.

🧹 Suggested cleanup
-    private Tensor<T> CreateOnesLike(Tensor<T> template)
-    {
-        var ones = new Tensor<T>(template.Shape.ToArray());
-        ones.Fill(NumOps.One);
-        return ones;
-    }

As per coding guidelines, "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs` around lines 388 - 404, Remove
the now-unused CreateOnesLike helper method: delete the private Tensor<T>
CreateOnesLike(Tensor<T> template) method (the unused symbol CreateOnesLike)
from the file so no dead/unreachable code remains after the scalar-based SiLU
rewrite in ComputeSiLUDerivative; also search for and remove any remaining
references to CreateOnesLike to ensure no callers remain.
src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs (1)

442-458: ⚠️ Potential issue | 🔴 Critical

Blocking: remove the unused CreateOnesLike helper.

After the scalar-based rewrite at Lines 447-449, the private helper at Lines 453-458 is dead code. Keep the cleanup in the same refactor.

🧹 Suggested cleanup
-    private Tensor<T> CreateOnesLike(Tensor<T> template)
-    {
-        var ones = new Tensor<T>(template.Shape.ToArray());
-        ones.Fill(NumOps.One);
-        return ones;
-    }

As per coding guidelines, "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs` around lines 442 -
458, Remove the dead helper CreateOnesLike by deleting the private method
CreateOnesLike(Tensor<T> template) and any references to it; the
ComputeSiLUDerivative method now uses scalar ops and no longer needs this
helper, so ensure there are no remaining calls to CreateOnesLike and run
tests/compile to confirm cleanup.
src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs (1)

607-621: ⚠️ Potential issue | 🔴 Critical

Blocking: remove the orphaned CreateOnesLike helper.

The SiLU refactor no longer calls this private method, so the file now ships dead code from the change. Production-ready code should delete the orphaned helper instead of leaving refactoring leftovers behind.

🧹 Suggested cleanup
-    private Tensor<T> CreateOnesLike(Tensor<T> template)
-    {
-        var ones = new Tensor<T>(template.Shape.ToArray());
-        ones.Fill(NumOps.One);
-        return ones;
-    }
-
     `#region` Parameter Management

As per coding guidelines, "Production Readiness (CRITICAL - Flag as BLOCKING)" includes "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs` around lines 607 -
621, Remove the unused helper CreateOnesLike from the file to eliminate dead
code introduced during the SiLU refactor: delete the private method
CreateOnesLike(Tensor<T> template) and any references (none expected) so only
ComputeSiLUDerivative and active helpers remain; ensure no compilation or style
issues remain after removal and run tests/build to verify.
src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs (1)

465-482: ⚠️ Potential issue | 🔴 Critical

Blocking: delete the unused CreateOnesLike helper.

After switching ComputeSiLUDerivative() to scalar/tensor ops, this private method is dead code. Production-ready code should remove the stale helper instead of leaving refactoring debris in the layer.

🧹 Suggested cleanup
-    private Tensor<T> CreateOnesLike(Tensor<T> template)
-    {
-        var ones = new Tensor<T>(template.Shape.ToArray());
-        ones.Fill(NumOps.One);
-        return ones;
-    }
-
     private Tensor<T> DepthwiseConv1DBackward(

As per coding guidelines, "Production Readiness (CRITICAL - Flag as BLOCKING)" includes "Dead code: Commented-out code blocks, unreachable code paths, unused variables/parameters that suggest incomplete refactoring".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs` around lines 465 - 482,
Remove the dead helper CreateOnesLike (private Tensor<T>
CreateOnesLike(Tensor<T> template)) from the file since ComputeSiLUDerivative
now uses scalar/tensor ops and there are no callers; delete the entire method
(including the local ones variable and Fill call) and ensure no other code
references CreateOnesLike remain.
♻️ Duplicate comments (6)
src/NeuralNetworks/Layers/SSM/RodimusLayer.cs (1)

386-388: ⚠️ Potential issue | 🔴 Critical

Initialize state and the t=0 snapshot after renting.

Line 446 reads state before any write, and Lines 472-477 only save allStates at t + 1. With pooled buffers, both the recurrence base state and _lastStates[..., 0, ...] become stale.

🔧 Suggested fix
         var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension });
         var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension });
         var selectionWeightsCache = TensorAllocator.Rent<T>(new[] { batchSize, seqLen, _numHeads, _headDimension });
+        state.Fill(NumOps.Zero);
+        for (int bi = 0; bi < batchSize; bi++)
+            for (int hi = 0; hi < _numHeads; hi++)
+                for (int di = 0; di < _headDimension; di++)
+                    for (int ki = 0; ki < _headDimension; ki++)
+                        allStates[new[] { bi, 0, hi, di, ki }] = NumOps.Zero;

Also applies to: 446-477

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs` around lines 386 - 388, The
rented buffers state, allStates and selectionWeightsCache are used before being
initialized; after calling TensorAllocator.Rent<T>(...) you must initialize
state (e.g. zero or the proper initial recurrence state) and immediately write
the t=0 snapshot into allStates at index 0 so _lastStates/.../0/... isn't left
stale; update the code around TensorAllocator.Rent usage (symbols: state,
allStates, selectionWeightsCache, TensorAllocator.Rent) to set state to the
correct initial values and copy/assign that state into allStates[..., 0, ...]
before the time-loop and before any reads.
src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs (1)

296-327: ⚠️ Potential issue | 🔴 Critical

Zero-initialize expertStates before the scan.

Lines 314-327 read expertStates[..., t, si], but after Line 296 switched that buffer to TensorAllocator.Rent<T>(...) many expert/timestep cells no longer start at zero. Newly activated experts can now pick up stale pooled state.

🔧 Suggested fix
-        var expertStates = TensorAllocator.Rent<T>(new[] { batchSize, _numExperts, seqLen + 1, _stateDimension });
+        var expertStates = TensorAllocator.Rent<T>(new[] { batchSize, _numExperts, seqLen + 1, _stateDimension });
+        expertStates.Fill(NumOps.Zero);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs` around lines 296 - 327,
After renting the buffer with TensorAllocator.Rent<T> for expertStates in
MixtureOfMambaLayer, zero-initialize the entire tensor before the scan loop to
avoid stale pooled values being read; e.g., immediately after expertStates =
TensorAllocator.Rent<T>(...) clear all entries to NumOps.Zero (via a provided
TensorAllocator.Clear/Fill utility or a tight nested loop over batchSize,
_numExperts, seqLen+1, _stateDimension) so reads like expertStates[new[] { bi,
expertIdx, t, si }] see zeros for uninitialized experts.
src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs (1)

324-326: ⚠️ Potential issue | 🔴 Critical

Zero state and seed the t=0 snapshot.

Line 351 reads state before any write, and Lines 376-381 only populate allStates at t + 1. On a rented buffer that makes both the recurrence base state and _lastStates[..., 0, ...] nondeterministic.

🔧 Suggested fix
         var state = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, _headDimension, _headDimension });
         var allStates = TensorAllocator.Rent<T>(new[] { batchSize, seqLen + 1, _numHeads, _headDimension, _headDimension });
+        state.Fill(NumOps.Zero);
+        for (int bi = 0; bi < batchSize; bi++)
+            for (int hi = 0; hi < _numHeads; hi++)
+                for (int di = 0; di < _headDimension; di++)
+                    for (int vi = 0; vi < _headDimension; vi++)
+                        allStates[new[] { bi, 0, hi, di, vi }] = NumOps.Zero;

Also applies to: 351-381

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs` around lines 324 - 326, The
rented buffer "state" and the snapshot array "allStates" must be initialized to
deterministic values and the t=0 snapshot seeded so subsequent reads (e.g.,
where "state" is read before being written and where "_lastStates[..., 0, ...]"
is expected) are well-defined; modify the allocation code that uses
TensorAllocator.Rent<T>(...) to clear or fill "state" with zeros (or default(T))
immediately after rental and copy that initial zeroed "state" into allStates at
index t=0 (and into _lastStates slice for t=0) before entering the recurrence
loop so the recurrence base and stored _lastStates[..., 0, ...] are
deterministic. Ensure you reference the variables "state", "allStates",
"_lastStates", and the use of TensorAllocator.Rent<T>" when making the change.
src/Memory/ForwardArena.cs (1)

136-136: ⚠️ Potential issue | 🟠 Major

Keep ShapeKey off the public surface.

This is allocator plumbing, not a facade-facing type. Leaving it public makes low-level memory internals part of the supported API for no consumer benefit.

🔒 Suggested change
-public readonly struct ShapeKey : IEquatable<ShapeKey>
+internal readonly struct ShapeKey : IEquatable<ShapeKey>

As per coding guidelines, "Users should ONLY interact with AiModelBuilder.cs and AiModelResult.cs" and "Helper classes, utilities, and implementation details should be internal."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/Memory/ForwardArena.cs` at line 136, The ShapeKey type is exposed
publicly but is an allocator/internal implementation detail; change its
accessibility to internal by updating the declaration of ShapeKey (the struct
named ShapeKey in ForwardArena) from public to internal (i.e., make it "internal
readonly struct ShapeKey : IEquatable<ShapeKey>") and then rebuild and fix any
compilation issues (if external tests or other assemblies need access, use
InternalsVisibleTo rather than making ShapeKey public); ensure no public API
signatures rely on ShapeKey so the public surface stays unchanged.
src/NeuralNetworks/Layers/SSM/RetNetLayer.cs (1)

381-382: ⚠️ Potential issue | 🟠 Major

Zero the causal-mask gap in retentionScores.

Only the j <= i region is written below. With a rented tensor, the cached j > i entries keep stale values, and _lastRetentionScores then exposes that garbage to backward passes.

🧹 Suggested fix
         var retentionScores = TensorAllocator.Rent<T>(new[] { batchSize, _numHeads, seqLen, seqLen });
+        for (int bi = 0; bi < batchSize; bi++)
+            for (int h = 0; h < _numHeads; h++)
+                for (int i = 0; i < seqLen; i++)
+                    for (int j = i + 1; j < seqLen; j++)
+                        retentionScores[new[] { bi, h, i, j }] = NumOps.Zero;
         _lastRetentionScores = retentionScores;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RetNetLayer.cs` around lines 381 - 382, The
rented tensor retentionScores (from TensorAllocator.Rent<T>(...)) contains stale
data in entries where j>i because only the causal lower triangle is written
later; zero out the causal-mask gap before storing it in _lastRetentionScores.
Locate the allocation around retentionScores and either initialize the full
tensor to zero or explicitly zero the strict upper-triangle (indices where j>i
for dimensions batchSize, _numHeads, seqLen, seqLen) so _lastRetentionScores
does not expose stale values to backward passes.
src/NeuralNetworks/Layers/SSM/RWKV7Block.cs (1)

673-673: ⚠️ Potential issue | 🟠 Major

input.Shape.ToArray() still allocates inside per-token normalization.

ApplyGroupNorm() runs once per token in TimeMixingForward(), and GroupNormBackward() repeats the same pattern during training. The tensor bodies come from TensorAllocator, but these fresh shape arrays keep the path from being allocation-free. Pass a caller-owned output/workspace buffer instead of renting from input.Shape.ToArray() inside the methods.

Also applies to: 926-926

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs` at line 673, ApplyGroupNorm and
GroupNormBackward allocate a fresh shape array per token via
TensorAllocator.Rent<T>(input.Shape.ToArray()) called inside TimeMixingForward;
change these methods to accept a caller-owned output/workspace Tensor (or Span)
parameter so the caller (TimeMixingForward) rents a buffer once and passes it
in, eliminating the input.Shape.ToArray() allocation. Update the signatures of
ApplyGroupNorm and GroupNormBackward to take the output/workspace buffer and
modify their call sites in TimeMixingForward (and the training path) to reuse
the pre-rented buffer instead of calling Rent with input.Shape.ToArray(); ensure
both occurrences (the forward call and the backward call) are updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Memory/ForwardArena.cs`:
- Around line 141-149: The ShapeKey constructor always clones the input array
causing allocations on every lookup (used by Rent() and RentUninitialized());
change the design so lookup keys don't clone: introduce a second constructor or
a static factory (e.g., ShapeKey.FromLookup(int[] shape) or ShapeKey(bool copy,
int[] shape)) and use it in Rent()/RentUninitialized() to create a
non-allocating key, while keeping the existing cloning behavior only when
creating a key that will be stored in the dictionary (e.g., when inserting a
brand-new slab); update dictionary insert paths to clone into the stored
ShapeKey and keep lookup paths non-copying to avoid per-lookup allocations.

In `@src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs`:
- Around line 346-347: The rented buffers state and allStates are uninitialized
and must be zeroed to avoid corrupting the first S*k update and the saved
history; after calling TensorAllocator.Rent<T>(...) for state and allStates in
GatedDeltaNetLayer (the two vars named state and allStates), immediately
clear/zero their contents (e.g., set all elements to default(T)/0) before any
reads or before the S * k path uses state and before recording allStates[..., 0,
...]; ensure you use the tensor/allocator API to zero the entire buffers so the
initial fast-weight matrix and initial snapshot are truly zeros.

In `@src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs`:
- Around line 437-438: The rented recurrent buffers 'state' and 'allStates' must
be zero-initialized immediately after allocation to avoid reading stale pooled
memory (they serve as S_{t-1} on first step and allStates[...,0,...] as the
cached initial state); after calling TensorAllocator.Rent<T>(...) for these
variables in GatedDeltaProductLayer.cs, set their contents to zero (either by
using a zeroing allocator API like RentZeroed<T> if available or by calling the
tensor clear/fill-to-zero method on 'state' and 'allStates') so the recurrence
starts from a guaranteed zero state.

In `@src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs`:
- Around line 283-288: The rented tensors returned by TensorAllocator.Rent<T>
(state, allStates, _lastKVGate, _lastKVGateRaw) must be zero-initialized because
state is read on the first timestep and allStates[:,0,...] holds the cached S0;
after switching from new Tensor<T> they may contain stale pool data. Fix by
immediately zeroing each rented tensor right after the Rent<T> call (e.g., call
the tensor's clear/zero-fill method or otherwise fill with default(T)) so
forward outputs and cached values are deterministic before any reads in the
methods that use state, allStates, _lastKVGate, and _lastKVGateRaw.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Around line 785-789: The initialization of x_prev in RWKV7Block.cs when t == 0
must be zero-initialized to avoid gradients depending on stale pooled data;
replace the TensorAllocator.Rent<T>(...) call for x_prev with a guaranteed-zero
allocation (e.g., new Tensor<T>(new[] { batchSize, _modelDimension }) or call a
clear/zero method on the rented tensor) so the first-token path that reads
x_prev sees zeros; update the x_prev allocation in the t == 0 branch (symbol:
x_prev, TensorAllocator.Rent<T>, Tensor<T>, t == 0) accordingly and ensure any
rented tensor is zeroed before use.

In `@src/NeuralNetworks/Layers/SSM/S6Scan.cs`:
- Line 243: The rented accumulators dh, d_B_t, and d_C_t are left uninitialized
and get stale data from the pool; before the first Add in the backward pass you
must zero them after renting (or use a rent-zero helper) so accumulation is
correct. Locate the Rent<T> calls that create dh (and the similar rents for
d_B_t and d_C_t around the referenced areas) and immediately clear their buffers
(e.g., fill with zeros or call an allocator RentZeroed equivalent) before any
Add/accumulate operations in the backward methods so the gradients are pure sums
of current contributions.
- Around line 101-103: Validate that initialState (when provided) has shape
[batch, innerDim, stateDim] before cloning; if initialState is null or has
incorrect shape, initialize h to a correctly shaped zero/neutral tensor of shape
determined from batch/innerDim/stateDim. Always copy h into the first time-slice
of allHiddenStates (allHiddenStates[:,0,...]) unconditionally so the rented
buffer cannot retain stale pool data, and ensure the same validation/assignment
logic is applied to the alternate path handled near the usage of h (the code
around creation/copy into h and the subsequent writes to allHiddenStates).
- Around line 212-213: The recomputation seed uses a fresh buffer h_recomp
instead of the recorded initial state, so change the allocation/assignment in
SequentialScanForward(): allocate or rent h_recomp with the same shape but then
copy/clone the recorded initial hidden state (hiddenStates at time index 0 /
hiddenStates[:,0,...]) into h_recomp before assigning recomputedStates[0];
ensure shapes match and preserve device/dtype semantics when cloning to avoid
computing gradients from a zeroed initial state.

---

Outside diff comments:
In `@src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs`:
- Around line 465-482: Remove the dead helper CreateOnesLike (private Tensor<T>
CreateOnesLike(Tensor<T> template)) from the file since ComputeSiLUDerivative
now uses scalar/tensor ops and there are no callers; delete the entire method
(including the local ones variable and Fill call) and ensure no other code
references CreateOnesLike remain.

In `@src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs`:
- Around line 607-621: Remove the unused helper CreateOnesLike from the file to
eliminate dead code introduced during the SiLU refactor: delete the private
method CreateOnesLike(Tensor<T> template) and any references (none expected) so
only ComputeSiLUDerivative and active helpers remain; ensure no compilation or
style issues remain after removal and run tests/build to verify.

In `@src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs`:
- Around line 442-458: Remove the dead helper CreateOnesLike by deleting the
private method CreateOnesLike(Tensor<T> template) and any references to it; the
ComputeSiLUDerivative method now uses scalar ops and no longer needs this
helper, so ensure there are no remaining calls to CreateOnesLike and run
tests/compile to confirm cleanup.

In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs`:
- Around line 388-404: Remove the now-unused CreateOnesLike helper method:
delete the private Tensor<T> CreateOnesLike(Tensor<T> template) method (the
unused symbol CreateOnesLike) from the file so no dead/unreachable code remains
after the scalar-based SiLU rewrite in ComputeSiLUDerivative; also search for
and remove any remaining references to CreateOnesLike to ensure no callers
remain.

In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs`:
- Around line 485-499: Remove the dead CreateOnesLike helper: delete the private
method CreateOnesLike(Tensor<T> template) since ComputeSiLUDerivative now
computes SiLU without it; search for any remaining references to CreateOnesLike
and remove or replace them, and run build/tests to ensure no other code depended
on it (also remove any now-unused imports or fields that existed solely for that
helper).

In `@src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs`:
- Around line 339-366: The forward path still allocates inside
LightningAttentionForward (Forward calls new Tensor<T> for state and allStates),
so change LightningAttentionForward to avoid internal allocations by adding
parameters (e.g., out or preallocated Tensor parameters like stateOut,
allStatesOut) or accept rented buffers and write into them; update Forward to
rent those tensors via TensorAllocator.Rent<T>(...) and pass them into
LightningAttentionForward (or provide a reusable per-batch buffer on the layer)
so no new Tensor<T>() is created inside LightningAttentionForward; ensure
_last... fields are set to the rented/result tensors and release/return buffers
consistently.

---

Duplicate comments:
In `@src/Memory/ForwardArena.cs`:
- Line 136: The ShapeKey type is exposed publicly but is an allocator/internal
implementation detail; change its accessibility to internal by updating the
declaration of ShapeKey (the struct named ShapeKey in ForwardArena) from public
to internal (i.e., make it "internal readonly struct ShapeKey :
IEquatable<ShapeKey>") and then rebuild and fix any compilation issues (if
external tests or other assemblies need access, use InternalsVisibleTo rather
than making ShapeKey public); ensure no public API signatures rely on ShapeKey
so the public surface stays unchanged.

In `@src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs`:
- Around line 324-326: The rented buffer "state" and the snapshot array
"allStates" must be initialized to deterministic values and the t=0 snapshot
seeded so subsequent reads (e.g., where "state" is read before being written and
where "_lastStates[..., 0, ...]" is expected) are well-defined; modify the
allocation code that uses TensorAllocator.Rent<T>(...) to clear or fill "state"
with zeros (or default(T)) immediately after rental and copy that initial zeroed
"state" into allStates at index t=0 (and into _lastStates slice for t=0) before
entering the recurrence loop so the recurrence base and stored _lastStates[...,
0, ...] are deterministic. Ensure you reference the variables "state",
"allStates", "_lastStates", and the use of TensorAllocator.Rent<T>" when making
the change.

In `@src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs`:
- Around line 296-327: After renting the buffer with TensorAllocator.Rent<T> for
expertStates in MixtureOfMambaLayer, zero-initialize the entire tensor before
the scan loop to avoid stale pooled values being read; e.g., immediately after
expertStates = TensorAllocator.Rent<T>(...) clear all entries to NumOps.Zero
(via a provided TensorAllocator.Clear/Fill utility or a tight nested loop over
batchSize, _numExperts, seqLen+1, _stateDimension) so reads like
expertStates[new[] { bi, expertIdx, t, si }] see zeros for uninitialized
experts.

In `@src/NeuralNetworks/Layers/SSM/RetNetLayer.cs`:
- Around line 381-382: The rented tensor retentionScores (from
TensorAllocator.Rent<T>(...)) contains stale data in entries where j>i because
only the causal lower triangle is written later; zero out the causal-mask gap
before storing it in _lastRetentionScores. Locate the allocation around
retentionScores and either initialize the full tensor to zero or explicitly zero
the strict upper-triangle (indices where j>i for dimensions batchSize,
_numHeads, seqLen, seqLen) so _lastRetentionScores does not expose stale values
to backward passes.

In `@src/NeuralNetworks/Layers/SSM/RodimusLayer.cs`:
- Around line 386-388: The rented buffers state, allStates and
selectionWeightsCache are used before being initialized; after calling
TensorAllocator.Rent<T>(...) you must initialize state (e.g. zero or the proper
initial recurrence state) and immediately write the t=0 snapshot into allStates
at index 0 so _lastStates/.../0/... isn't left stale; update the code around
TensorAllocator.Rent usage (symbols: state, allStates, selectionWeightsCache,
TensorAllocator.Rent) to set state to the correct initial values and copy/assign
that state into allStates[..., 0, ...] before the time-loop and before any
reads.

In `@src/NeuralNetworks/Layers/SSM/RWKV7Block.cs`:
- Line 673: ApplyGroupNorm and GroupNormBackward allocate a fresh shape array
per token via TensorAllocator.Rent<T>(input.Shape.ToArray()) called inside
TimeMixingForward; change these methods to accept a caller-owned
output/workspace Tensor (or Span) parameter so the caller (TimeMixingForward)
rents a buffer once and passes it in, eliminating the input.Shape.ToArray()
allocation. Update the signatures of ApplyGroupNorm and GroupNormBackward to
take the output/workspace buffer and modify their call sites in
TimeMixingForward (and the training path) to reuse the pre-rented buffer instead
of calling Rent with input.Shape.ToArray(); ensure both occurrences (the forward
call and the backward call) are updated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: d8e06a5a-7942-4786-8e52-a90541475a01

📥 Commits

Reviewing files that changed from the base of the PR and between 554b23f and d8673da.

📒 Files selected for processing (22)
  • src/Memory/ForwardArena.cs
  • src/NeuralNetworks/Layers/DiffusionConvLayer.cs
  • src/NeuralNetworks/Layers/SSM/ABCLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedDeltaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedDeltaProductLayer.cs
  • src/NeuralNetworks/Layers/SSM/GatedSlotAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/HGRN2Layer.cs
  • src/NeuralNetworks/Layers/SSM/HedgehogLayer.cs
  • src/NeuralNetworks/Layers/SSM/HyenaLayer.cs
  • src/NeuralNetworks/Layers/SSM/KimiLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/LogLinearAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/MesaNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMambaLayer.cs
  • src/NeuralNetworks/Layers/SSM/MixtureOfMemoriesLayer.cs
  • src/NeuralNetworks/Layers/SSM/MultiLatentAttentionLayer.cs
  • src/NeuralNetworks/Layers/SSM/RWKV7Block.cs
  • src/NeuralNetworks/Layers/SSM/RebasedLayer.cs
  • src/NeuralNetworks/Layers/SSM/RetNetLayer.cs
  • src/NeuralNetworks/Layers/SSM/RodimusLayer.cs
  • src/NeuralNetworks/Layers/SSM/S6Scan.cs
  • src/NeuralNetworks/Layers/SSM/TTTLayer.cs
  • src/NeuralNetworks/Layers/SSM/TransNormerLLMLayer.cs

- ShapeKey: only defensive-copy in EnsureCapacity (storage), not in Rent
  (hot-path lookup). Eliminates per-Rent allocation from Clone().
- GatedDeltaNetLayer, GatedDeltaProductLayer, KimiLinearAttentionLayer:
  reverted state/allStates buffer allocations back to new Tensor<T>()
  for guaranteed zero initialization of recurrent state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 05:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 80 out of 80 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +428 to +436
// Cache intermediate values for backward — zero allocation via workspace
var allR = Ws.Sequence(SqAllR);
var allK = Ws.Sequence(SqAllK);
var allV = Ws.Sequence(SqAllV);
var allA = Ws.Sequence(SqAllA);
var allB = Ws.Sequence(SqAllB);
var allWkv = Ws.Sequence(SqAllWkv);
var allWkvPreGate = Ws.Sequence(SqAllWkvPre);
var allWkvGated = Ws.Sequence(SqAllWkvGated);
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allWkvPreGate is declared from Ws.Sequence(SqAllWkvPre) but never used. With TreatWarningsAsErrors enabled, this will fail the build (unused local). Either remove the local (and consider removing the SqAllWkvPre workspace declaration if it's truly unused) or actually populate/use this buffer as intended (e.g., cache pre-gate or pre-norm WKV values).

Copilot uses AI. Check for mistakes.
@ooples ooples merged commit ff7e902 into master Apr 5, 2026
28 of 50 checks passed
@ooples ooples deleted the perf/tensor-allocator-rent-1014 branch April 5, 2026 21:33
ooples added a commit that referenced this pull request Apr 5, 2026
…terface segregation

Resolved 3 merge conflicts between PR #1079 (hardcoded double conversion) and
master (interface segregation #1077, arena allocator #1083). Kept master's
InterfaceGuard/ParamModel caching and PR's SIMD-accelerated operations.

Fixed pre-existing NumOps.Tanh(T) scalar calls that don't exist — Tanh on
NumOps is span-only. Added ScalarTanh/ScalarSigmoid to MetaLearnerBase using
NumOps.Exp primitives (matching IcaDecomposition pattern). Fixed NTMAlgorithm
inner classes with same formula.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ooples added a commit that referenced this pull request Apr 5, 2026
…terface segregation

Resolved 3 merge conflicts between PR #1079 (hardcoded double conversion) and
master (interface segregation #1077, arena allocator #1083). Kept master's
InterfaceGuard/ParamModel caching and PR's SIMD-accelerated operations.

Fixed NumOps.Tanh(T) calls — Tanh on NumOps is span-only. Added ScalarTanh and
ScalarSigmoid to MetaLearnerBase using NumOps.Exp primitives. Fixed NTMAlgorithm
inner classes with same formula. Replaced manual Softmax with Engine.Softmax.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: migrate raw tensor allocations to TensorAllocator.Rent and IEngine acceleration

2 participants