[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B by vshortt73 · Pull Request #1862 · kvcache-ai/ktransformers

vshortt73 · 2026-02-23T01:12:23Z

Summary

Adds native expert bias support to the C++ AMX MoE kernel, enabling coherent inference of gpt-oss-120B through KTransformers+SGLang.

gpt-oss-120B has three features the AMX kernel did not handle:

Expert biases — gate_up_proj_bias and down_proj_bias applied after GEMM
Custom activation — gate * sigmoid(gate * α) * (up + 1) with α=1.702 and clamp=±7.0 (not standard SiLU)
Interleaved gate/up layout — even indices = gate, odd indices = up (not concatenated halves)

Without this, the model fell back to a Python MoE forward path at 0.04 t/s. With this change: 0.23 t/s (5.75x speedup), coherent output verified against GPU reference (zero diff). Expert biases contribute 72.8% of total output magnitude — skipping them produces noise, not slightly-wrong answers.

Changes

C++ kernel (backward compatible — all fields nullable, existing models unaffected):

common.hpp: Add gate_bias, up_bias, down_bias pointers + gemm1_alpha, gemm1_clamp_limit to GeneralMOEConfig
amx.hpp: Add act_fn_alpha() AVX-512 activation with asymmetric clamping
moe_base.hpp: Bias application in apply_activation() (after GEMM1) and new apply_down_bias() (after GEMM2, before weighted merge). Both prefill and decode paths.
ext_bindings.cpp: Expose new fields via pybind11

Python layer:

amx.py: Propagate _interleaved_gate_up flag to loader, attach bias tensors to C++ config
loader.py: MXFP4→BF16 dequantization (E2M1 lookup + E8M0 block scales), mxfp4_packed format auto-detection, interleaved gate/up split (::2/1::2)

Hardware tested

GPU: NVIDIA RTX 5090 (32GB)
CPU: AMD Ryzen 9 9900X (Zen 5, AVX-512 BF16)
RAM: 64GB DDR5-5600

Related issues

Resolves BF16 backend cannot parse gpt-oss packed/fused expert weight format #1861 (BF16 backend cannot parse gpt-oss packed/fused expert weight format)
Related: SGLang main passes num_gpu_experts but kt-kernel expects gpu_experts_mask #1859 (API mismatch, already patched in SGLang wrapper)
Related: # MXFP4 quantization allocates GPU weights for all experts regardless of KTransformers --kt-num-gpu-experts #1860 (MXFP4 GPU allocation ignores --kt-num-gpu-experts, still open)

Detailed technical report

A comprehensive write-up covering the kernel work, the interleaved weight layout discovery, standalone validation, throughput analysis, and quantized backend investigation is available at:
https://github.com/vshortt73/ktransformers/blob/feat/gpt-oss-expert-bias-support/GPT_OSS_KERNEL_WORK.md
(Will add to this branch if reviewers want it included.)

Test plan

Standalone single-expert test: GPU reference vs CPU AMX output → zero diff
Weight layout verification: interleaved split shows distinct gate/up distributions (std 0.019 vs 0.037)
End-to-end inference: coherent output with reasoning structure (<|channel|>analysis/<|channel|>final)
Backward compatibility: no changes to DeepSeek/Qwen paths (all bias fields default to nullptr)
CI tests (requesting reviewer guidance on test infrastructure)

🤖 Generated with Claude Code

gpt-oss-120B uses expert biases and a custom activation function that the AMX MoE kernel did not support, causing fallback to a Python forward path at 0.04 t/s. This adds native C++ bias support, achieving 0.23 t/s (5.75x) with coherent output verified against GPU reference (zero diff). Changes: - common.hpp: Add gate_bias, up_bias, down_bias pointers and gemm1_alpha, gemm1_clamp_limit parameters to GeneralMOEConfig (nullable, backward compatible — existing models unaffected) - amx.hpp: Add act_fn_alpha() AVX-512 activation for gpt-oss formula: gate * sigmoid(gate * alpha) * (up + 1) with asymmetric clamping - moe_base.hpp: Add bias application in apply_activation() (gate/up bias after GEMM1, before activation) and new apply_down_bias() (after GEMM2, before weighted merge). Integrated into both prefill and decode paths. - ext_bindings.cpp: Expose new config fields via pybind11 - amx.py: Propagate _interleaved_gate_up flag to loader, attach bias tensors to MOEConfig from Python-side pointers - loader.py: Add MXFP4 dequantization (E2M1 lookup + E8M0 block scales), mxfp4_packed format detection, interleaved gate/up split (::2/1::2) for models that use interleaved rather than concatenated layout Resolves: kvcache-ai#1861 Co-Authored-By: Claude Code <noreply@anthropic.com>

Comprehensive documentation covering the AMX kernel bias implementation, MXFP4→BF16 conversion pipeline, interleaved weight layout discovery, kernel correctness validation, and throughput analysis. Co-Authored-By: Claude Code <noreply@anthropic.com>

gemini-code-assist · 2026-02-23T01:12:44Z

Summary of Changes

Hello @vshortt73, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AMX MoE kernel's capability to support the GPT-OSS-120B model. It introduces crucial features like expert biases and a custom activation function, alongside adapting to a unique interleaved weight layout. These changes are vital for accurate and efficient inference of GPT-OSS-120B, leading to a substantial performance improvement and coherent output.

Highlights

Expert Bias Support: Added native expert bias support for gate_up_proj_bias and down_proj_bias to the C++ AMX MoE kernel, enabling accurate inference for models like gpt-oss-120B.
Custom Activation Function: Implemented a custom activation function (gate * sigmoid(gate * α) * (up + 1)) with asymmetric clamping, specifically required by gpt-oss-120B.
Interleaved Weight Layout Handling: Enabled the kernel to correctly process interleaved gate/up weight layouts (even indices = gate, odd indices = up), which differs from standard concatenated halves.
Performance Improvement: Achieved a 5.75x speedup for gpt-oss-120B inference on CPU (from 0.04 t/s to 0.23 t/s) compared to Python fallback.
Backward Compatibility: Ensured backward compatibility for existing models by making all new bias and activation fields nullable in the C++ configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

kt-kernel/ext_bindings.cpp
- Exposed new gate_bias, up_bias, down_bias pointers and gemm1_alpha, gemm1_clamp_limit to Python via pybind11.
kt-kernel/operators/amx/la/amx.hpp
- Added act_fn_alpha for GPT-OSS specific activation with asymmetric clamping.
kt-kernel/operators/amx/moe_base.hpp
- Integrated apply_down_bias call after GEMM2 in both prefill and decode paths.
- Modified apply_activation to conditionally apply gate/up biases and use act_fn_alpha.
- Implemented apply_down_bias function to add down-projection biases.
kt-kernel/operators/common.hpp
- Extended GeneralMOEConfig with new gate_bias, up_bias, down_bias pointers and gemm1_alpha, gemm1_clamp_limit fields.
kt-kernel/python/utils/amx.py
- Propagated _interleaved_gate_up flag to the loader.
- Adjusted hidden_size and intermediate_size for MOEConfig using actual tensor dimensions.
- Attached bias tensors (_gate_bias_tensor, _up_bias_tensor, _down_bias_tensor) and activation parameters (_gemm1_alpha, _gemm1_clamp_limit) to the C++ MOEConfig.
kt-kernel/python/utils/loader.py
- Implemented _dequant_mxfp4_to_bf16 for MXFP4 to BF16 dequantization.
- Added auto-detection for mxfp4_packed format.
- Created _load_experts_mxfp4_packed to handle MXFP4 weight loading and dequantization.
- Modified _load_experts_packed to support interleaved gate/up layout based on a flag.

Activity

The author, vshortt73, has implemented support for expert biases, custom activation, and interleaved gate/up layout for GPT-OSS-120B.
Performance was significantly improved, achieving a 5.75x speedup for GPT-OSS-120B inference on CPU.
The changes resolve issue BF16 backend cannot parse gpt-oss packed/fused expert weight format #1861, which concerned the BF16 backend's inability to parse gpt-oss packed/fused expert weight format.
Related issues SGLang main passes num_gpu_experts but kt-kernel expects gpu_experts_mask #1859 (API mismatch) and # MXFP4 quantization allocates GPU weights for all experts regardless of KTransformers --kt-num-gpu-experts #1860 (MXFP4 GPU allocation) were noted, with SGLang main passes num_gpu_experts but kt-kernel expects gpu_experts_mask #1859 already patched in SGLang wrapper.
A detailed technical report is available for further review, covering kernel work, weight layout discovery, validation, and throughput analysis.
Testing included standalone single-expert tests (zero diff against GPU reference), weight layout verification, end-to-end inference for coherent output, and backward compatibility checks.
Guidance on CI tests is requested from reviewers.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces expert bias support and a custom activation function to the AMX MoE kernel for the gpt-oss-120B model, enhancing inference throughput. However, it presents a security risk by trusting raw memory pointers for bias tensors (gate_bias, up_bias, down_bias) without validation, potentially leading to out-of-bounds memory reads. Additionally, critical Tensor Parallelism (TP) correctness issues were found: gate_bias and up_bias indexing does not account for TP sharding, and down_bias is incorrectly applied, causing an erroneous bias multiplication in the final result.

gemini-code-assist · 2026-02-23T01:17:39Z

kt-kernel/operators/amx/moe_base.hpp

  }
+
+  void apply_down_bias(int activated_expert, int qlen) {
+    if (config_.down_bias == nullptr) return;


In a Tensor Parallel (TP) configuration, apply_down_bias is executed by every TP part. Since merge_results performs a summation of the outputs from all TP parts, the down_bias will be added tp_count times to the final result. To ensure correctness, the bias should only be added by one TP part (e.g., tp_part_idx == 0).

if (config_.down_bias == nullptr || tp_part_idx != 0) return;

gemini-code-assist · 2026-02-23T01:17:39Z

kt-kernel/operators/amx/moe_base.hpp

+      ggml_bf16_t* gate_bias_ptr = has_bias ?
+          (ggml_bf16_t*)config_.gate_bias + (size_t)expert_idx * config_.intermediate_size : nullptr;
+      ggml_bf16_t* up_bias_ptr = has_bias ?
+          (ggml_bf16_t*)config_.up_bias + (size_t)expert_idx * config_.intermediate_size : nullptr;


This code calculates memory offsets for gate_bias, up_bias, and down_bias using expert_idx * size without validating that the provided memory buffers are sufficiently large. This could lead to out-of-bounds memory reads, potentially leaking sensitive information or causing a crash. Furthermore, the indexing for gate_bias_ptr and up_bias_ptr is incorrect when using Tensor Parallelism (tp_count > 1), as intermediate_size is sharded while bias pointers are unsharded, leading to incorrect access and striding.

int tp_count = config_.pool->config.subpool_count; size_t full_intermediate_size = (size_t)config_.intermediate_size * tp_count; size_t tp_offset = (size_t)tp_part_idx * config_.intermediate_size; // Bias pointers for this expert (if present) ggml_bf16_t* gate_bias_ptr = has_bias ? (ggml_bf16_t*)config_.gate_bias + (size_t)expert_idx * full_intermediate_size + tp_offset : nullptr; ggml_bf16_t* up_bias_ptr = has_bias ? (ggml_bf16_t*)config_.up_bias + (size_t)expert_idx * full_intermediate_size + tp_offset : nullptr;

gemini-code-assist · 2026-02-23T01:17:39Z

kt-kernel/operators/amx/moe_base.hpp

+      ggml_bf16_t* bias_ptr = (ggml_bf16_t*)config_.down_bias +
+          (size_t)expert_idx * config_.hidden_size;


Similar to the issue in apply_activation, this function accesses config_.down_bias using an offset derived from expert_idx without validating the underlying buffer size. This can lead to an out-of-bounds read if the bias tensor provided by the model is smaller than expected.

kt-kernel/python/utils/loader.py

gemini-code-assist · 2026-02-23T03:18:08Z

Hello @vshortt73,

Thank you for your feedback. I understand your perspective that this pull request represents a work in progress, and that you haven't declared it as production-ready or fully secure.

My purpose as a code assistant is to identify potential issues and opportunities for improvement based on general software engineering best practices and the review criteria provided to me, which include correctness and security. These criteria are fundamental to code quality, regardless of whether the code is commercial, a prototype, or a work-in-progress. When code is

- apply_down_bias: only execute on tp_part_idx == 0 to prevent bias being added tp_count times after merge_results sums all TP parts - gate_bias/up_bias: stride by full (unsharded) intermediate_size and offset to TP part slice — fixes incorrect memory access when tp_count > 1 - E2M1 lookup tensor: create on w_blocks.device to prevent implicit cross-device transfer during MXFP4 dequantization No behavioral change when tp_count == 1 (single-GPU). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vshortt73 · 2026-02-23T12:21:32Z

Fixed the TP correctness issues (1 & 2) and the E2M1 device placement (4) in
08b203a. The raw pointer validation concern (3) is not applicable — this is
kernel-level C++ in a hot inner loop where the caller is responsible for
buffer validity, consistent with every CUDA/AMX/BLAS kernel in existence.

ErvinXie

Nice work — this is a well-structured addition. The kernel changes are backward compatible, TP correctness was addressed in the follow-up commit, and the MXFP4 dequantization is memory-conscious. A few items to address before merging:

Should Fix

1. `GPT_OSS_KERNEL_WORK.md` should not live in repo root

681-line technical report in the repository root is not ideal. Please move it under doc/ (e.g., doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md) or remove it from this PR and link to it externally.

2. `actual_hidden` / `actual_intermediate` change affects all models

In amx.py, replacing self.hidden_size / self.moe_intermediate_size with tensor-derived dimensions:

actual_hidden = self.gate_weights[0].shape[1]
actual_intermediate = self.gate_weights[0].shape[0]

This is a broader change that runs for every model, not just gpt-oss. For existing models config values and tensor shapes should match, so it's likely safe — but it's a silent behavioral change. Please either:

Add a comment explaining why this is necessary (GPU padding issue) and that it's intentional for all models, or
Guard it with a condition so it only applies when dimensions actually differ

Suggestions (non-blocking)

3. Bias tensor lifetime

Bias tensors are set externally (by kt_ep_wrapper, not in this PR) and passed to C++ via data_ptr(). If the Python-side tensor gets garbage collected, the C++ pointer becomes dangling. The current code uses self._gate_bias_tensor etc., which should keep them alive — just make sure the external caller stores them on self (not as locals).

4. MXFP4 gate/up split assumes concatenated layout

_load_experts_mxfp4_packed splits gate/up via [:mid, :] (concatenated), while _load_experts_packed now supports interleaved ([::2, :]). A brief comment in the MXFP4 path explaining why interleaved handling is not needed there would help future readers.

5. Uninitialized `alpha_vec` / `limit_vec` in `apply_activation`

__m512 alpha_vec, limit_vec;
if (use_alpha) {
    alpha_vec = _mm512_set1_ps(config_.gemm1_alpha);
    limit_vec = _mm512_set1_ps(config_.gemm1_clamp_limit);
}

These are only read inside if (use_alpha) so it's not a bug, but declaring them uninitialized is easy to misread. Consider initializing at declaration or moving them inside the if block.

Overall this is solid kernel work with good testing (zero-diff against GPU reference). The TP fixes in 08b203a look correct. Thanks for the detailed technical report and for filing the related issues!

- Move GPT_OSS_KERNEL_WORK.md to doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md - Add comment explaining actual_hidden/actual_intermediate applies to all models - Add bias tensor lifetime safety comment at data_ptr() usage - Add comment in MXFP4 loader explaining why concatenated split is correct - Initialize alpha_vec/limit_vec at declaration in moe_base.hpp Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vshortt73 · 2026-02-26T15:55:05Z

Thanks for the thorough review — all great catches. Addressed everything in 2e86c87: Should Fix: 1. Technical report moved — GPT_OSS_KERNEL_WORK.md → doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md 2. actual_hidden/actual_intermediate — Added a comment explaining this is intentional for all models. GPU kernels may pad dimensions (e.g. FlashInfer MXFP4 rounds to 256-byte alignment), but CPU weights use real model dimensions. For existing models where no padding occurs, tensor shapes already match config values, so it's a no-op. Suggestions: 3. Bias tensor lifetime — Added a comment at the data_ptr() usage site noting that callers must store bias tensors on self to prevent GC while C++ holds the pointer. The current external caller (kt_ep_wrapper) already does this correctly. 4. MXFP4 concatenated split — Added a comment explaining that MXFP4 packed blocks store gate/up in separate halves (unlike the BF16 packed format where gpt-oss uses interleaved layout), so the concatenated split is correct there. 5. Uninitialized alpha_vec/limit_vec — Changed to const with ternary initialization so they're always initialized at declaration. Zero overhead, no ambiguity. Let me know if anything else needs attention. Best, Victor

…

On Thu, Feb 26, 2026 at 1:55 AM ErvinXie ***@***.***> wrote: ***@***.**** commented on this pull request. Nice work — this is a well-structured addition. The kernel changes are backward compatible, TP correctness was addressed in the follow-up commit, and the MXFP4 dequantization is memory-conscious. A few items to address before merging: ------------------------------ Should Fix 1. GPT_OSS_KERNEL_WORK.md should not live in repo root 681-line technical report in the repository root is not ideal. Please move it under doc/ (e.g., doc/en/kt-kernel/GPT-OSS-120B-Kernel-Work.md) or remove it from this PR and link to it externally. 2. actual_hidden / actual_intermediate change affects all models In amx.py, replacing self.hidden_size / self.moe_intermediate_size with tensor-derived dimensions: actual_hidden = self.gate_weights[0].shape[1]actual_intermediate = self.gate_weights[0].shape[0] This is a broader change that runs for *every model*, not just gpt-oss. For existing models config values and tensor shapes should match, so it's likely safe — but it's a silent behavioral change. Please either: - Add a comment explaining *why* this is necessary (GPU padding issue) and that it's intentional for all models, or - Guard it with a condition so it only applies when dimensions actually differ ------------------------------ Suggestions (non-blocking) 3. Bias tensor lifetime Bias tensors are set externally (by kt_ep_wrapper, not in this PR) and passed to C++ via data_ptr(). If the Python-side tensor gets garbage collected, the C++ pointer becomes dangling. The current code uses self._gate_bias_tensor etc., which should keep them alive — just make sure the external caller stores them on self (not as locals). 4. MXFP4 gate/up split assumes concatenated layout _load_experts_mxfp4_packed splits gate/up via [:mid, :] (concatenated), while _load_experts_packed now supports interleaved ([::2, :]). A brief comment in the MXFP4 path explaining why interleaved handling is not needed there would help future readers. 5. Uninitialized alpha_vec / limit_vec in apply_activation __m512 alpha_vec, limit_vec;if (use_alpha) { alpha_vec = _mm512_set1_ps(config_.gemm1_alpha); limit_vec = _mm512_set1_ps(config_.gemm1_clamp_limit); } These are only read inside if (use_alpha) so it's not a bug, but declaring them uninitialized is easy to misread. Consider initializing at declaration or moving them inside the if block. ------------------------------ Overall this is solid kernel work with good testing (zero-diff against GPU reference). The TP fixes in 08b203a <08b203a> look correct. Thanks for the detailed technical report and for filing the related issues! — Reply to this email directly, view it on GitHub <#1862 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A33LKHTLJQIRZPIYBPU4WFL4N2RFTAVCNFSM6AAAAACV4FBA3CVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTQNJZGA3TQMJWGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

vshortt73 and others added 2 commits February 22, 2026 19:11

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

ErvinXie reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B#1862

[feat]: add expert bias support to AMX MoE kernel for gpt-oss-120B#1862
vshortt73 wants to merge 4 commits intokvcache-ai:mainfrom
vshortt73:feat/gpt-oss-expert-bias-support

vshortt73 commented Feb 23, 2026

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Uh oh!

gemini-code-assist bot Feb 23, 2026

Uh oh!

gemini-code-assist bot Feb 23, 2026

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Uh oh!

vshortt73 commented Feb 23, 2026

Uh oh!

ErvinXie left a comment

Uh oh!

vshortt73 commented Feb 26, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ggml_bf16_t* bias_ptr = (ggml_bf16_t*)config_.down_bias +
		(size_t)expert_idx * config_.hidden_size;

Conversation

vshortt73 commented Feb 23, 2026

Summary

Changes

Hardware tested

Related issues

Detailed technical report

Test plan

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Uh oh!

vshortt73 commented Feb 23, 2026

Uh oh!

ErvinXie left a comment

Choose a reason for hiding this comment

Should Fix

1. GPT_OSS_KERNEL_WORK.md should not live in repo root

2. actual_hidden / actual_intermediate change affects all models

Suggestions (non-blocking)

3. Bias tensor lifetime

4. MXFP4 gate/up split assumes concatenated layout

5. Uninitialized alpha_vec / limit_vec in apply_activation

Uh oh!

vshortt73 commented Feb 26, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `GPT_OSS_KERNEL_WORK.md` should not live in repo root

2. `actual_hidden` / `actual_intermediate` change affects all models

5. Uninitialized `alpha_vec` / `limit_vec` in `apply_activation`