Virtual Parameter Synthesis (VPS) experiments for improving the internal mathematical and reasoning capabilities of LLMs
Here is link of colab's whole testing and experiments: https://colab.research.google.com/drive/1WpErVF1rGpzTJi_vm1doaG3N1_aBT1eS.
This VPS configuration is also used in our Idef-mathematics A6 model, which is built on top of Qwen.
In internal evaluations, A6 achieved a +4–5% improvement in reasoning accuracy on IMO-inspired geometric problems compared to the previous A5 model.
🔗 Learn more about our Geometry Intelligence system here:
https://idef-mathematics.com/
Virtual Parameter Sharpening (VPS) is an inference-time technique that augments frozen transformer linear layers with dynamic low-rank perturbations. Unlike fine-tuning or standard LoRA, VPS constructs its low-rank factors on-the-fly from activation statistics and optional gradient signals, enabling test-time adaptation without persistent parameter updates. This document provides a rigorous analysis of the mathematical foundations, architectural design, and mechanisms by which VPS may improve reasoning performance.
For a standard linear layer with weight matrix
where:
-
$A \in \mathbb{R}^{d_{in} \times r}$ and$B \in \mathbb{R}^{d_{out} \times r}$ are dynamically constructed low-rank factors -
$\gamma \in [0, 1]$ is a scaling coefficient (adaptive or fixed) -
$r \ll \min(d_{in}, d_{out})$ is the perturbation rank
The factors are derived from the frozen weights via:
where
Substituting the definitions, the full output becomes:
This is equivalent to:
Key observation: The perturbation
When
This selectively amplifies the interaction between specific input features (selected by
The builder subsystem determines which input/output dimensions receive perturbation. Three strategies are implemented:
Algorithm:
- Compute input activation scores:
$s_{in}^{(j)} = \frac{1}{N}\sum_{i=1}^N |x_{ij}|$ - Compute output activation scores:
$s_{out}^{(k)} = \frac{1}{N}\sum_{i=1}^N |[xW^\top]_{ik}|$ - Select top-$k$ indices for each: $\mathcal{I}{in} = \text{top-}k(s{in})$, $\mathcal{I}{out} = \text{top-}k(s{out})$
- Construct one-hot matrices: $U_{j,c} = \mathbf{1}[j = \mathcal{I}{in}^{(c)}]$, $V{k,c} = \mathbf{1}[k = \mathcal{I}_{out}^{(c)}]$
Mathematical rationale: By selecting high-activation dimensions, VPS focuses its perturbation budget on the feature subspace most relevant to the current input. This is a form of activation-guided sparsity that avoids wasting capacity on dormant dimensions.
The SC builder refines the SK selection by solving a ridge regression that couples input activations to output activations:
Algorithm (after obtaining SK selections):
- Extract compact views:
$X_A \in \mathbb{R}^{N \times r}$ (activations at selected input indices),$Y \in \mathbb{R}^{N \times r}$ (outputs at selected output indices) - Solve the ridge system:
$(X_A^\top X_A + \alpha I)T = X_A^\top Y$ - Mix columns of
$V$ :$\tilde{V} = V T^\top$ , then column-normalize
Mathematical rationale: This solves a local least-squares problem that finds a linear coupling
The ridge regularization
Uses SC when gradient information is available (indicating an optimization context), otherwise falls back to SK. This provides adaptive complexity based on available signals.
To prevent the low-rank perturbation from destabilizing the forward pass, each rank-1 component is clipped:
This ensures
Mathematical rationale: The spectral norm of the perturbation
By bounding each rank-1 term, we obtain
The policy module computes a batch-level "energy" statistic:
where
This maps energy
- Low energy (near-zero activations) →
$\sigma \approx 0$ → minimal perturbation - High energy (strongly activated) →
$\sigma \approx 1$ → full perturbation
Token-level entropy
Rationale: High entropy indicates model uncertainty. In uncertain states, increasing the perturbation may help the model escape local optima in its implicit reasoning trajectory. This is analogous to simulated annealing where temperature is raised in regions of high uncertainty.
Given bounds
This provides smooth, input-dependent hyperparameter scheduling without discrete mode switching.
The system maintains a small memory (
-
$s_i$ = step direction from previous iterations -
$y_i$ = gradient difference
The standard L-BFGS two-loop recursion is applied to precondition the delta:
where
This is an unusual application of L-BFGS—typically used in optimization, here repurposed for inference-time conditioning. The intuition is that the curvature information accumulated across inference steps captures second-order structure of the loss landscape. Preconditioning the perturbation by this approximation may align it with directions of high curvature (rapid change), potentially amplifying informative gradient directions.
Caveat: In the provided code, this is applied with a small scale factor (
The CompositeVerifier computes a weighted loss across multiple objectives:
| Component | Description | Weight (default) |
|---|---|---|
numeric |
|
1.0 |
units |
Dimensional consistency check via pint
|
0.5 |
self_consistency |
Variance across multiple samples | 0.3 |
algebraic |
Structural match via sympy simplification |
0.2 |
Total loss:
The verification loss provides a signal for the policy:
improved = (L_current < L_previous)
policy.update_outcome(improved, L_previous - L_current)
Recent improvement history influences future scaling decisions, implementing a form of meta-learning across inference iterations.
A short cross-entropy surrogate is computed:
Backpropagating this populates grad_h buffers, which the SC builder can then use to construct more informed selectors. This creates a feedback path from the verification objective to the perturbation construction.
Query and Key projection layers in attention blocks are paired:
q._peer = k; k._peer = q
q.is_Q = True; k.is_K = TrueIn self-attention,
This is motivated by the observation that effective attention requires alignment between query and key representations—random independent perturbations would likely degrade this alignment.
Unlike static weights, VPS modulates computation based on the current input. For reasoning tasks, different problems activate different feature subspaces. VPS's activation-guided selection focuses perturbation on the currently-relevant subspace, potentially:
- Amplifying task-specific signal pathways
- Suppressing irrelevant dimensions that might introduce noise
The dynamic nature of the perturbation means that each input effectively sees a slightly different "effective model." This provides an implicit ensemble effect without multiple forward passes, which may improve robustness on out-of-distribution reasoning patterns.
The verification loop implements a form of search in output space:
- Generate candidate
- Evaluate via multi-objective verifier
- Update perturbation policy based on feedback
- Generate refined candidate
This resembles beam search or MCTS in that it explores multiple trajectories, but operates through perturbation of the forward pass rather than explicit tree expansion.
The ephemeral L-BFGS component captures local curvature information. In optimization, preconditioning by the inverse Hessian converts gradient descent into Newton's method, which has superior convergence in convex regions. The heuristic application here may help the perturbation align with "informative" directions in the representation space.
┌─────────────────────────────────────────────────────────────────┐
│ VPS Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input x ─────────────────────┬─────────────────────► Wx │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ SK/SC Builder │ │ │
│ │ (constructs U,V)│ │ │
│ └────────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ A = W^T V │ │ │
│ │ B = W U │ │ │
│ └────────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ Spectral Clip │ │ │
│ │ (||A_i||·||B_i|| │ │ │
│ │ ≤ τ) │ │ │
│ └────────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────┐ │ │
│ │ δ = (xA)B^T │ │ │
│ └────────┬─────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ y = Wx + γ·δ │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────┐ │
│ │ Adaptive Policy │◄──── Token Entropy │
│ │ (γ, r, order) │◄──── Improvement Hist │
│ └──────────────────┘◄──── Activation Energy │
│ │
└─────────────────────────────────────────────────────────────────┘
| Parameter | Type | Default | Description |
|---|---|---|---|
rank |
int | 2 | Base rank of low-rank perturbation |
topk |
int | 32 | Number of features selected by SK builder |
gamma |
float | 0.5 | Perturbation scaling coefficient |
tau |
float | 0.8 | Spectral norm clip threshold |
builder |
str | "hybrid" | Builder type: "sk", "sc", or "hybrid" |
order |
int | 1 | Order of delta expansion (1 or 2) |
qk_coupling |
bool | True | Enable Q/K pairing in attention |
lbfgs_enabled |
bool | True | Enable ephemeral L-BFGS |
adaptive_rank |
bool | True | Enable energy-based rank adaptation |
adaptive_gamma |
bool | True | Enable energy-based gamma adaptation |
alpha |
float | 1e-3 | Ridge regularization for SC builder |
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
- Nocedal, J., & Wright, S. J. (2006). Numerical Optimization. Springer. (L-BFGS)
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
vps/
├── vpscore/
│ ├── vps_linear.py # Core VPSLinear wrapper
│ ├── builders.py # SK, SC, Hybrid builders
│ ├── policy.py # Adaptive policy system
│ ├── math_utils.py # Entropy, spectral utilities
│ ├── ephemeral_lbfgs.py # Curvature memory
│ ├── hooks.py # Activation/gradient capture
│ ├── patch_hf.py # HuggingFace model patching
│ ├── config.py # Configuration dataclass
│ └── verifiers/
│ └── composite_verifier.py # Multi-objective verification
└── scripts/
├── infer_vps.py # Inference with iterative refinement
└── run_ablations.py # Ablation study harness
Document prepared for technical review. All mathematical formulations derived directly from source code analysis.