A Characterization Report (DRAFT)
Transparency Note: Mathematical derivations and symbolic analysis were assisted by large language models. Empirical experiments, data collection, and implementation were conducted directly by the author. Readers are encouraged to verify derivations and report any discrepancies via the repository issue tracker.
We introduce PSiLU (Parametric Sigmoid Linear Unit), a continuously differentiable activation function extending the Swish/SiLU family via two learnable parameters:
This report provides a mathematical characterization of PSiLU, including domain, range, derivatives, gradient bounds, fixed points, and behavioral regimes. Empirical analysis identifies a stable subspace ("manifold") within the
Modern neural networks predominantly employ fixed, homogeneous activation functions (e.g., ReLU, GELU, Swish). While effective, this design imposes a structural prior where every neuron must adapt its weights to work within the same nonlinearity. Recent research has explored increasing activation flexibility via discrete selection (MoE-style) or unbounded approximation (splines-KAN, PAU), though these introduce routing complexity or computational cost.
PSiLU is designed as a continuous extension of the smooth gated activation family. By introducing two learnable parameters per unit, PSiLU allows each neuron to adapt its activation shape while maintaining differentiability and hardware efficiency. This report characterizes the mathematical properties and empirical behavior of PSiLU without asserting universal superiority.
This report provides:
- A formal description of PSiLU and its relationship to existing activations.
- Mathematical characterization: domain, range, derivatives, gradient bounds, fixed points, monotonicity.
- Behavioral taxonomy in
$(\beta, \delta)$ parameter space. - Approximation capacity analysis for ReLU, SiLU, and GELU.
- Empirical manifold sweep methodology for identifying stable initialization regions.
- Signal propagation analysis: gain factors and variance dynamics in MLPs and ResNets.
- Descriptive statistics on parameter dynamics and preliminary task performance.
All results are presented as observed. The question of how to encourage full utilization of PSiLU's behavioral diversity remains open.
For input
This formulation factors the shift
import torch
import torch.nn as nn
class PSiLU(nn.Module):
"""
Parametric Sigmoid Linear Unit.
Formula: f(x; β, δ) = x · σ(βx + δ)
Parameters β and δ are provided externally (e.g., learned per-channel).
"""
def forward(self, x: torch.Tensor, beta: torch.Tensor, delta: torch.Tensor) -> torch.Tensor:
return x * torch.sigmoid(beta * x + delta)Parameters
A more elaborate implementation:
_stratified_init(param: nn.Parameter, low: float, high: float):
n = param.numel()
grid = torch.linspace(low, high, n)
idx = torch.randperm(n)
with torch.no_grad():
param.copy_(grid[idx])
class PSiLU(nn.Module):
"""
f(x; β, δ) = x · σ(β·x + δ) [self-gated, default]
f(x, v; β, δ) = x · σ(β·v + δ) [external gate]
"""
def __init__(
self,
channels: int,
beta_range: Tuple[float, float] = (-1.0, 5.0),
delta_range: Tuple[float, float] = (-5.0, 5.0),
init: str = "stratified", # "stratified" | "normal" | "uniform" | "silu"
):
super().__init__()
self.channels = channels
self.beta_range = beta_range
self.delta_range = delta_range
self.init_method = init
self.beta = nn.Parameter(torch.empty(channels))
self.delta = nn.Parameter(torch.empty(channels))
self._init_params()
def _init_params(self):
m = self.init_method
if m == "stratified":
_stratified_init(self.beta, *self.beta_range)
_stratified_init(self.delta, *self.delta_range)
elif m == "normal":
nn.init.normal_(self.beta, mean=1.0, std=1.0)
nn.init.normal_(self.delta, mean=0.0, std=1.0)
elif m == "uniform":
nn.init.uniform_(self.beta, *self.beta_range)
nn.init.uniform_(self.delta, *self.delta_range)
elif m == "silu":
nn.init.constant_(self.beta, 1.0)
nn.init.constant_(self.delta, 0.0)
else:
raise ValueError(f"Unknown init: {m}")
def forward(
self,
x: torch.Tensor,
gate_input: Optional[torch.Tensor] = None,
) -> torch.Tensor:
v = gate_input if gate_input is not None else x
if v.dim() >= 2:
shape = (1,) * (v.dim() - 1) + (self.channels,)
beta = self.beta.view(*shape)
delta = self.delta.view(*shape)
else:
beta, delta = self.beta, self.delta
return x * torch.sigmoid(beta * v + delta)
@torch.no_grad()
def get_params(self) -> Tuple[np.ndarray, np.ndarray]:
return (
self.beta .detach().cpu().numpy().copy(),
self.delta.detach().cpu().numpy().copy(),
)
def extra_repr(self):
return f"channels={self.channels}, init={self.init_method}"| Activation | PSiLU Parameterization | Notes |
|---|---|---|
| SiLU / Swish | Exact recovery | |
| Linear (slope |
Exact recovery | |
| ReLU | Limit case (sigmoid → step) | |
| GELU | Approximate | |
| Inverse SiLU | Exact; non-monotonic regime | |
| Zero function | Limit case |
| Property | Statement |
|---|---|
| Domain |
|
| Range ( |
|
| Range ( |
|
| Range ( |
|
| Gate factor | |
| Asymptotic ( |
|
| Asymptotic ( |
|
| Asymptotic ( |
|
| Asymptotic ( |
|
Unlike ReLU (range $[0,\infty)$) or tanh (range $(-1,1)$), PSiLU's range depends on
Since
-
Continuity:
$f$ is continuous$\forall x \in \mathbb{R}, \forall (\beta, \delta) \in \mathbb{R}^2$ . -
Differentiability:
$f$ is differentiable to all orders$\forall x \in \mathbb{R}$ . -
Contrast with ReLU: ReLU is not differentiable at
$x = 0$ . PSiLU has no such kink — the gate smoothly interpolates around any threshold.
Let
This decomposes into the gate value
The second derivative (curvature) is:
Setting
We analyze tight bounds on
Asymptotic analysis (
- As
$x \to +\infty$ :$s \to 1$ ,$\beta x (1-s) \to 0$ →$f'(x) \to 1$ - As
$x \to -\infty$ :$s \to 0$ ,$\beta x \cdot s \to 0$ →$f'(x) \to 0$
Global minimum of
Numerical evaluation (
|
|
|
|
|---|---|---|
| 1.0 | -1.278 | -0.0854 |
| 2.0 | -0.760 | -0.101 |
| 5.0 | -0.400 | -0.068 |
Summary bounds:
| Condition | Gradient Range | Practical Implication |
|---|---|---|
| $f'(x) \in [f'(x^), 1)$ where $f'(x^) \approx -0.09$ | Mild negative dip; stable in practice (matches SiLU) | |
| Constant gradient; no saturation, no vanishing | ||
|
|
Potential gradient amplification; monitor during training | |
| Large |
|
Risk of vanishing gradients or dead units |
Fixed points. Solving
∴
Monotonicity. PSiLU is not monotone in general.
| Condition | Monotone? | Why |
|---|---|---|
| Yes, strictly |
|
|
|
|
No | Slight negative dip near |
| Approximately yes | Approaches ReLU | |
| No | Gate inverts; |
PSiLU's two-parameter formulation enables it to approximate several widely-used activation functions through appropriate choice of
| Target Activation | Optimized |
Approximation Loss (MSE) | Notes |
|---|---|---|---|
| SiLU / Swish | Exact recovery by definition | ||
| GELU | Close approximation | ||
| ReLU | Sharp gating regime; residual error from sigmoid smoothness | ||
|
Linear ( |
Limit case | Approaches identity as gate saturates to 1 | |
| Zero function | Limit case | Gate saturates to 0 |
Observation: PSiLU recovers SiLU exactly and approximates GELU and ReLU with low error using only two scalar parameters. This suggests the smooth gated family is sufficiently expressive to capture key behaviors of common activations within the tested domain.
Visual analysis of the forward function and its derivatives reveals distinct behavioral regimes:
| Regime | Parameters | Forward Shape |
1st Derivative |
2nd Derivative |
Interpretation |
|---|---|---|---|---|---|
| SiLU | Smooth rectification | Peak |
Smooth curvature | Standard baseline | |
| Hard Gate | Step-like (ReLU-ish) | Sharp spike (Dirac-like) | High frequency oscillation | Feature selection | |
| Linear | Straight line ( |
Constant |
Zero | Identity pass-through | |
| Always-On | Near-linear ( |
Constant |
Near Zero | Signal preservation | |
| Always-Off | Flat ( |
Near Zero | Near Zero | Channel pruning | |
| Inverse | Non-monotonic (dip) | Negative region | High curvature | Inhibitory dynamics |
Heatmaps over the
-
Rectification Ratio: High
$\beta$ + Low$\delta$ $\rightarrow$ High rectification (ReLU-like). Low$\beta$ $\rightarrow$ Symmetric/Linear. -
Linearity Index: High
$\delta$ ("Always-On") creates linear regions ($R^2 \approx 1$ ). Low$\delta$ creates non-linear regions. -
Gate Saturation: Low
$\beta$ + Low$\delta$ leads to saturation (output$\approx 0$ ). -
Negative Output: ReLU-like behavior (No negative output) occurs at High
$\beta$ and Low$\delta$ . Standard SiLU/GeLU reside in the "Has negative output" zone.
Analysis of gradient statistics across the

-
Min Gradient (Dead/Dying Risk):
-
Red Zones (Near 0): Occur at extreme
$\delta$ values (far from 0). -
Implication: Channels with large
$|\delta|$ saturate, leading to vanishing gradients ("dead neurons").
-
Red Zones (Near 0): Occur at extreme
-
Max Gradient (Explosion Risk):
-
Green Zones (> 1.5): Occur primarily when
$\beta < 0$ and$\delta < -2$ . -
Implication: "Inverse" configurations with negative shifts are dangerous and can cause gradient explosion. High
$\beta$ alone is less risky than negative$\beta$ .
-
Green Zones (> 1.5): Occur primarily when
-
Mean Gradient:
- Generally stable (
$\approx 0.4 - 0.8$ ) in the central region ($\beta \in [0, 3], \delta \in [-1, 1]$ ).
- Generally stable (
-
Fraction of Negative Gradients:
- Non-zero only when
$\beta < 0$ or$\delta$ is extreme. -
Implication: Standard positive
$\beta$ configurations are monotonic ($f'(x) \ge 0$ ), ensuring stable optimization.
- Non-zero only when
The empirical gain factor
Visual Gain Surface:
-
Low Gain Valley: A broad valley of low gain (
$g < 0.4$ ) exists for$\beta < 0$ and$\delta < 0$ . -
High Gain Ridge: Gain
$> 1.0$ is only achieved in specific regions (e.g.,$\beta \approx -1, \delta \approx 3$ ), but these often coincide with instability risks. -
Standard Region: For
$\beta \in [1, 2], \delta=0$ , gain is consistently$\approx 0.56 - 0.60$ .
Numerical Gain Table:
| Gain |
|
Regime | ||
|---|---|---|---|---|
| -1.00 | 0.00 | 0.5615 | 3.17 | Inverse gate |
| 0.00 | 0.00 | 0.5000 | 4.00 | Linear |
| 1.00 | 0.00 | 0.5609 | 3.18 | SiLU (baseline) |
| 1.78 | 0.00 | 0.5857 | 2.91 | GELU-like |
| 10.00 | 0.00 | 0.5864 | 2.91 | Hard gate |
| 1.00 | 3.00 | 0.8939 | 1.25 | Always-on (high gain) |
| 1.00 | -3.00 | 0.1642 | 37.10 | Always-off (low gain) |
Observations:
- PSiLU is inherently variance-shrinking for standard configurations (
$g \approx 0.56$ ). - Standard He/Xavier initialization may lead to signal collapse without compensation.
Empirical analysis compares initialization strategies in shallow MLPs:

| Strategy | Diversity Score | Median Var Ratio | Visual Characteristic |
|---|---|---|---|
| Manifold/Stratified | 0.212 | 0.350 | Wide spread of |
| Normal | 0.190 | 0.329 | Clustered around mean |
| Uniform | 0.175 | 0.351 | Random spread; may hit unstable zones |
| Static SiLU | 0.000 | 0.313 | All channels identical |
Observations:
- Static initialization (all channels = SiLU) leads to negligible parameter drift; the model stays stuck in the "SiLU basin".
- Stratified/Manifold initialization preserves parameter diversity throughout training, suggesting heterogeneous activation behaviors are maintained and suggesting each behavior serves a purpose.
-
Stratified PSiLU: Starts with ~27% dead channels (due to aggressive
$\delta$ sampling). Recovers to ~10% after training. - ReLU: Starts with ~27% dead channels. Recovers to ~7.5%.
- Static SiLU: Starts with ~10% dead channels. Stays at 0% (stable but less flexible).
- Insight: Stratified PSiLU allows the network to dynamically prune and activate channels during training, acting as a form of learnable capacity allocation.
3.8.1 Plain MLP (No Residual Connections)
| Activation | Final Variance | Max Grad Norm | Min Grad Norm | Final Loss |
|---|---|---|---|---|
| ReLU | 0.0002 | 0.000125 | 0.148 | |
| PSiLU-stratified | 0.0000 | 0.000013 | 0.160 | |
| PSiLU-normal | 0.0000 | 0.000001 | 0.179 | |
| PSiLU-silu-init | 0.0000 | ~0 | 0.205 | |
| SiLU / GELU | 0.0000 | ~0 | 0.175–0.211 |
Plain MLP (No Residuals): All activations suffer exponential variance decay in plain MLPs >10 layers. PSiLU mitigates decay slightly by including high-gain channels but remains susceptible to vanishing signals without normalization.
Residual MLP (With $y = x + f(x)$)

| Activation | Final Variance (Layer 20) | Variance Growth Factor |
|---|---|---|
| ReLU | 803,650 | ~2.0× per layer |
| PSiLU-silu-init | 408,607 | ~1.8× per layer |
| SiLU | 281,306 | ~1.7× per layer |
| GELU | 419,049 | ~1.8× per layer |
| PSiLU-stratified | 32,974 | ~1.4× per layer |
| PSiLU-normal | 30,712 | ~1.4× per layer |
3.8.2 Residual MLP
Residual connections cause variance explosion ($Var(y) \approx Var(x) + Var(f(x))$).
- ReLU explodes fastest (all neurons active).
- PSiLU-stratified explodes slowest because it includes "Always-off" channels where
$f(x) \approx 0$ , making those residual blocks act closer to identity maps.
We perform a high-resolution grid sweep over
-
Gradient Stability:
$\min f'(x) > -0.5$ and$\max f'(x) < 2.2$ . -
Active Gating:
$\text{std}_x[\sigma(\cdot)] > 0.08$ (ensures gate switches). -
Dynamic Range:
$\text{mean}_x[\sigma(\cdot)] \in (0.05, 0.95)$ (avoids saturation).
Total valid coordinates: 163441
Observations:
- The valid region is non-convex and excludes extreme
$\beta < 0$ (explosion risk) and large$|\delta|$ (saturation). - SiLU and GELU reference points lie within the manifold, confirming they are stable starting points.
- The manifold includes diverse regimes: linear, hard gating, and shifted thresholds.
Note: These results are presented for illustrative purposes only. Tasks are synthetic and shallow; no claims of generalizability or superiority are made.
| Activation | Final Loss | Final Grad Norm | Final Residual Norm |
|---|---|---|---|
| ReLU | 0.148 | 0.288 (unstable) | 7.85 (high) |
| PSiLU-stratified | 0.160 | 0.128 (stable) | 4.38 |
| PSiLU-silu-init | 0.205 | 0.095 | 3.29 (lowest) |
| SiLU | 0.211 | 0.094 | 3.17 |
PSiLU variants produce more stable gradient norms than ReLU (less spiky), at the cost of slightly higher final loss in this simple task. The trade-off favors PSiLU in deeper, more complex networks where gradient stability is critical.
| Activation | Final Loss | ||
|---|---|---|---|
| PSiLU-stratified | 4.1616 | 2.002 | 0.006 |
| PSiLU-silu-init | 4.1607 | 1.004 | 0.007 |
| ReLU / SiLU / GELU | ~4.161 | NaN | NaN |
Findings:
- All activations converge to nearly identical loss in this small-scale setting.
- PSiLU-silu-init parameters remain fixed near
$(1, 0)$ , confirming the "SiLU basin" stability. - PSiLU-stratified parameters drift to
$\beta \approx 2.0$ , suggesting the model finds value in sharper gating for attention/FFN layers.
- Static Init (SiLU): Parameters show negligible drift. The model does not learn to deviate from the initial SiLU shape.
-
Stratified Init: Parameters remain diverse throughout training. Channels specialize: some become linear (
$\beta \approx 0$ ), some become hard gates ($\beta \gg 1$ ), some shift ($\delta \neq 0$ ).
Note: These results are presented for illustrative purposes only. Tasks are synthetic and shallow; no claims of generalizability or superiority are made.
Setup: 2-layer MLP (width=128) trained on 
| Activation | Validation Loss | Final Grad Norm | Dead Neuron % |
|---|---|---|---|
| PSiLU (Manifold) | 0.0146 | 0.0120 | 2.98% |
| ReLU | 0.4526 | 0.0040 | N/A |
| SiLU | 0.7914 | 0.0037 | N/A |
Parameter Dynamics (PSiLU):
- Mean Absolute Beta Drift: 0.0273
- Mean Absolute Delta Drift: 0.0314
- Observation: Parameters barely moved, suggesting the initialization manifold itself provided a favorable starting point.
Setup: MLP on 10-class synthetic data with low-dimensional features, 100 epochs.
| Activation | Validation Loss | Final Grad Norm | Dead Neuron % |
|---|---|---|---|
| PSiLU (Manifold) | 0.0971 | 0.00017 | 13.47% |
| ReLU | 0.3254 | 0.00033 | N/A |
| SiLU | 0.2628 | 0.00169 | N/A |
Parameter Dynamics (PSiLU):
- Mean Absolute Beta Drift: 0.0193
- Mean Absolute Delta Drift: 0.0195
- Dead Neuron Change: +5.38% (Initial 8.09% → Final 13.47%)
- Observation: The increase in dead neurons suggests adaptive capacity allocation, where the network prunes unnecessary channels via
$\delta$ shifts.
- Regression: PSiLU exhibited higher gradient norms than ReLU/SiLU, correlating with lower loss.
- Classification: PSiLU exhibited lower gradient norms than ReLU/SiLU but achieved lower loss, suggesting more efficient gradient flow (less wasted gradient).
PSiLU offers a minimal, flexible extension of the smooth gated activation family, accepting dynamic parameters
Empirical manifold sweep methodology identifies a subspace of
We invite further investigation into:
- Mechanisms that promote meaningful adaptation in learnable activations (e.g., specialized regularization, meta-learning).
- The relationship between PSiLU's output statistics and layer-wise normalization requirements.
- Scalability to large-scale architectures and real-world benchmarks.
- Theoretical analysis of the optimization landscape for
$(\beta, \delta)$ parameters.
This report serves as a technical description to facilitate reproducibility and discussion. All observations are preliminary; no claims of superiority or universal applicability are made.
PSiLU sits at the intersection of several research directions focused on activation flexibility, signal propagation, and functional diversity.
| Category | Related Work | PSiLU Position |
|---|---|---|
| Gated Activations | Swish (Ramachandran et al., 2017), GELU (Hendrycks & Gimpel, 2023) | PSiLU extends the Swish/SiLU family with per-channel learnable |
| Parametric Activations | PReLU (He et al., 2015), ACON (Ma et al., 2021) | PSiLU adds both steepness and shift parameters, enabling richer shape adaptation including non-monotonic regimes. |
| Activation Diversity Strategies | Stochastic Activations (Lomeli et al., 2025) | PSiLU employs diverse parameter initialization to encourage functional heterogeneity. While Stochastic Activations use runtime randomness to select between discrete functions (e.g., SiLU/ReLU), PSiLU achieves diversity through continuous parameter space exploration. |
| Adaptive Function Search | KAN (Liu et al., 2025), PAU (Molina et al., 2020) | PSiLU bounds the search to the smooth gated family, prioritizing efficiency and stability over unbounded expressivity. |
Contextual Notes:
- Vs. Stochastic Activations: Lomeli et al. (2025) propose randomly selecting between non-linear functions during training to circumvent optimization issues (e.g., dead ReLU neurons) and increase generation diversity. PSiLU shares the goal of introducing functional diversity but achieves this via deterministic, learnable continuous parameters rather than runtime stochastic selection.
- Vs. PReLU: While PReLU learns a slope for negative inputs, PSiLU learns the gating shape itself, allowing channels to transition between linear, gated, and suppressed regimes.
- Vs. KAN/PAU: Kolmogorov-Arnold Networks and Padé Activation Units explore broader function spaces (splines, rational functions). PSiLU restricts the search space to smooth gated linear units to maintain hardware efficiency and optimization stability.
This document is provided as a technical resource for the community. If the characterization, empirical findings, or implementation details presented here contribute to your research or development, we welcome attribution via the citation below. We also welcome feedback, corrections, and collaboration regarding further characterization of PSiLU.
BibTeX Citation:
@misc{psilu2026,
title = {PSiLU: Parametric Sigmoid Linear Unit: A Characterization},
author = {[GP Bayu]},
year = {2026},
howpublished = {\url{https://github.com/gbyuvd/PSiLU}},
note = {Technical Report Draft, Version 1.1}
}References:
@misc{ramachandran2017searchingactivationfunctions,
title={Searching for Activation Functions},
author={Prajit Ramachandran and Barret Zoph and Quoc V. Le},
year={2017},
eprint={1710.05941},
archivePrefix={arXiv},
primaryClass={cs.NE},
url={https://arxiv.org/abs/1710.05941},
}
@misc{hendrycks2023gaussianerrorlinearunits,
title={Gaussian Error Linear Units (GELUs)},
author={Dan Hendrycks and Kevin Gimpel},
year={2023},
eprint={1606.08415},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1606.08415},
}
@misc{he2015delvingdeeprectifierssurpassing,
title={Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
author={Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun},
year={2015},
eprint={1502.01852},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/1502.01852},
}
@misc{ma2021activatenotlearningcustomized,
title={Activate or Not: Learning Customized Activation},
author={Ningning Ma and Xiangyu Zhang and Ming Liu and Jian Sun},
year={2021},
eprint={2009.04759},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2009.04759},
}
@misc{lomeli2025stochasticactivations,
title={Stochastic activations},
author={Maria Lomeli and Matthijs Douze and Gergely Szilvasy and Loic Cabannes and Jade Copet and Sainbayar Sukhbaatar and Jason Weston and Gabriel Synnaeve and Pierre-Emmanuel Mazaré and Hervé Jégou},
year={2025},
eprint={2509.22358},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.22358},
}
@misc{molina2020padeactivationunitsendtoend,
title={Pad\'e Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks},
author={Alejandro Molina and Patrick Schramowski and Kristian Kersting},
year={2020},
eprint={1907.06732},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1907.06732},
}
@misc{liu2025kankolmogorovarnoldnetworks,
title={KAN: Kolmogorov-Arnold Networks},
author={Ziming Liu and Yixuan Wang and Sachin Vaidya and Fabian Ruehle and James Halverson and Marin Soljačić and Thomas Y. Hou and Max Tegmark},
year={2025},
eprint={2404.19756},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2404.19756},
}import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
def empirical_psilu_sweep(res_beta=500, res_delta=500, res_x=300):
"""
High-resolution sweep with refined constraints to identify stable, diverse (β,δ) subspace.
Returns valid beta/delta coordinates and boolean validity mask.
"""
print(f"Starting Refined High-Res Sweep ({res_beta}x{res_delta})...")
# 1. Setup Grid
beta_vals = np.linspace(-2.0, 8.0, res_beta, dtype=np.float32)
delta_vals = np.linspace(-6.0, 6.0, res_delta, dtype=np.float32)
B, D = np.meshgrid(beta_vals, delta_vals)
x_vals = np.linspace(-3.0, 3.0, res_x, dtype=np.float32)
# 2. Pre-allocate 2D Metric Arrays
max_grad = np.zeros_like(B)
min_grad = np.zeros_like(B)
gate_std = np.zeros_like(B)
mean_gate = np.zeros_like(B)
# 3. Chunked Processing (memory-efficient)
chunk_size = 50
for start_row in range(0, res_delta, chunk_size):
end_row = min(start_row + chunk_size, res_delta)
b_chunk = B[start_row:end_row, :, np.newaxis]
d_chunk = D[start_row:end_row, :, np.newaxis]
x_slice = x_vals[np.newaxis, np.newaxis, :]
z = b_chunk * x_slice + d_chunk
sig_z = 1.0 / (1.0 + np.exp(-z))
g = sig_z + x_slice * b_chunk * sig_z * (1.0 - sig_z)
max_grad[start_row:end_row, :] = np.max(g, axis=2)
min_grad[start_row:end_row, :] = np.min(g, axis=2)
gate_std[start_row:end_row, :] = np.std(sig_z, axis=2)
mean_gate[start_row:end_row, :] = np.mean(sig_z, axis=2)
# 4. Refined Constraints
safe_flow = (max_grad < 2.2) & (min_grad > -0.5) # Gradient stability
active_gating = gate_std > 0.08 # Gate switches across x
dynamic_range = (mean_gate > 0.05) & (mean_gate < 0.95) # Avoid saturation
validity_mask = safe_flow & active_gating & dynamic_range
# Apply smoothing for clean visual contour
validity_mask_smoothed = gaussian_filter(validity_mask.astype(np.float32), sigma=1.5) > 0.5
# 5. Visualization (optional)
fig, axes = plt.subplots(1, 4, figsize=(26, 6))
extent = [-2, 8, -6, 6]
def add_overlay(ax):
ax.contour(B, D, validity_mask_smoothed, levels=[0.5], colors='white', linewidths=1.2)
im1 = axes[0].imshow(min_grad, extent=extent, origin='lower', aspect='auto', cmap='RdYlGn')
add_overlay(axes[0]); axes[0].set_title("Min Gradient Field"); plt.colorbar(im1, ax=axes[0])
im2 = axes[1].imshow(gate_std, extent=extent, origin='lower', aspect='auto', cmap='plasma')
add_overlay(axes[1]); axes[1].set_title("Gate Std Dev"); plt.colorbar(im2, ax=axes[1])
im3 = axes[2].imshow(mean_gate, extent=extent, origin='lower', aspect='auto', cmap='coolwarm')
add_overlay(axes[2]); axes[2].set_title("Mean Gate (Bias)"); plt.colorbar(im3, ax=axes[2])
axes[3].imshow(validity_mask_smoothed, extent=extent, origin='lower', aspect='auto', cmap='gray')
axes[3].scatter([1.0], [0.0], color='cyan', marker='*', s=200, label="SiLU", edgecolors='black')
axes[3].scatter([1.702], [0.0], color='lime', marker='*', s=200, label="GeLU", edgecolors='black')
axes[3].set_title("FINAL SAMPLING MANIFOLD"); axes[3].legend()
for ax in axes: ax.set_xlabel("Beta (β)"); ax.set_ylabel("Delta (δ)")
plt.tight_layout(); plt.show()
return B[validity_mask_smoothed], D[validity_mask_smoothed], validity_mask_smoothed
if __name__ == "__main__":
valid_b, valid_d, mask = empirical_psilu_sweep()
np.savez("psilu_manifold.npz", betas=valid_b, deltas=valid_d)
print(f"Manifold analysis complete. Total valid coordinates: {len(valid_b)}")This report is a technical description intended to facilitate reproducibility and discussion. All mathematical derivations, empirical observations, and code snippets are provided as-is. No claims of superiority, universal applicability, or production readiness are made. Further validation across diverse architectures, datasets, and training regimes is encouraged.
