kamil271e · kamil271e · Oct 6, 2025 · Oct 6, 2025 · Oct 6, 2025 · Oct 6, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,6 @@
 *.npy
 *.png
+!figures/*.png
 *.pt
 *.ods
 *.json

diff --git a/README.md b/README.md
@@ -1 +1,228 @@
-# ml-model-extraction
+# Stealing the Properties of Black-Box Deep Learning Models
+
+[![View Paper](https://img.shields.io/badge/View-Paper-blue?style=for-the-badge)](https://drive.google.com/file/d/1Ya0daEysYh4WrCZDu6l9oqQLs_eGaD8R/view?usp=sharing)
+
+*A research toolkit for inferring hidden structure from restricted (black-box) model access.*
+
+This repository investigates how much internal information can be recovered from machine-learning services that expose only query responses. We study both theoretical and empirical avenues for reconstructing projection matrices, hidden dimensions, and intermediate representations. The findings inform mitigations for API-protected models and complement recent analyses of production model theft [Carlini et al., 2024](#ref-carlini) and logits leakage [Finlayson et al., 2024](#ref-finlayson).
+
+
+> 🏆 The biggest achievement is the ability to recover hidden dimensions of black-box autoencoder architectures (see [Singular value spectrum analysis](#singular-value-spectrum-analysis)).
+
+---
+
+## Overview
+- Works in strict black-box regimes with only input/output access.
+- Targets projection matrices, hidden dimensions, and post-normalization activations.
+- Provides reproducible experiments, evaluation utilities, and visualization assets.
+- Complements the accompanying paper (link [here](https://drive.google.com/file/d/1Ya0daEysYh4WrCZDu6l9oqQLs_eGaD8R/view?usp=sharing)).
+
+## Core Contributions
+- Implements two **SVD-driven recovery pipelines**:
+  - **Algebraic extraction** recovers projection matrices up to a nonsingular transformation.
+  - **Geometric extraction** recovers normalized projection matrices up to an orthogonal factor.
+- Supplies tooling for spectrum analysis, hidden dimension estimation, data generation, and evaluation.
+- Introduces an algorithm for reconstructing hidden representations via **Givens rotations** [<a href="#ref-shalit">Shalit & Chechik, 2013</a>], the **orthogonal  <a href="#ref-wikipedia-procrustes">Procrustes</a> objective**, and **Adam** optimization.
+
+---
+
+## Methodology
+### Algebraic extraction
+
+<table>
+<tr>
+<td width="55%" valign="top">
+
+Recovers an approximation of the projection matrix $\tilde{W}$.
+
+Assumes access to logits and an exploitable end-of-network bottleneck where the final dimension exceeds the penultimate one.
+
+Constructs a logits matrix $Q$ from API responses, with each column containing one logit vector.
+
+Inspects the singular spectrum to estimate the hidden dimension $h$; a pronounced drop reveals a low-rank structure. A truncated decomposition yields a projection matrix up to a nonsingular transformation $G$.
+
+</td>
+<td width="45%" valign="top" align="center">
+$\,$
+
+$Q = U \Sigma V^{\top}$
+
+$W H^{\top} = U \Sigma V^{\top}$
+
+$W H^{\top} V = U \Sigma$
+
+$G = H^{\top} V$
+
+$\tilde{W} = W G$
+
+$G \in \mathbb{R}^{h \times h}$
+
+</td>
+</tr>
+</table>
+
+
+
+
+
+
+### Geometric extraction
+Normalizing layers (RMSNorm / LayerNorm) project activations onto a sphere. The resulting ellipsoidal constraints define a linear system whose solution recovers the folded projection matrix up to orthogonality:
+
+$$
+\tilde{W}_\gamma = W_\gamma O^{\top},
+$$
+
+where $W_\gamma$ folds the projection matrix with normalization weights $\gamma$ and $O$ is an orthogonal matrix.
+
+### Representation recovery
+Targets post-ReLU, normalized hidden states. Combines Givens-parameterized orthogonal matrices, an orthogonal Procrustes alignment step, and Adam to align recovered states with the ground-truth structure under sparsity and normalization constraints. The optimization variable $\theta$ encodes the Givens rotation angles. $X$ is known matrix used during optimization - derived in chapter 3.3 of the paper.
+
+$$
+X = Q^\top \cdot W_{\gamma}^{\dagger \top} = \bar{H} \cdot O^\top
+$$
+
+#### Pseudocode: Factorization algorithm
+Input:  $X \in \mathbb{R}^{V \times h}$, epochs $E$, learning rate $lr$, alternating steps $T$
+
+Output: $\hat{H}$, $\hat{O}$
+
+```python
+θ = init_theta(h)
+
+for epoch in range(E):
+    Ô = build_O_from_givens(θ)
+    loss = objective(X, Ô)
+    θ = adam_step(θ, grad(loss, θ))
+    if diverged(loss):
+        break
+
+Ô = build_O_from_givens(θ)
+Ĥ = X @ Ô
+
+for _ in range(T):
+    Ĥ = clamp_and_normalize(Ĥ)
+    Ô = procrustes(X, Ĥ)
+    Ĥ = X @ Ô
+
+Ĥ = clamp_and_normalize(Ĥ)
+return Ĥ, Ô
+```
+
+Loss function:
+<!-- <pre> ```math \mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2, ``` </pre>  -->
+$$
+\mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2,
+$$
+
+with
+<!-- <pre> ```math \mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2, ``` </pre>  -->
+$$
+  \begin{aligned}
+    \mathcal{L}_1 &= \sum_{i,j} \max(0, -\hat{H}_{ij}) \\
+    \mathcal{L}_2 &= \sum_{i=1}^{n} \left(\lVert \hat{H}[i,:]\rVert_2 - 1\right)^2.
+  \end{aligned}
+$$
+
+---
+
+## Evaluation and Experiments
+
+### Singular value spectrum analysis
+- Autoencoder architectures with bottlenecks between decoder layers provide promising settings for spectral analyses across the stack.
+- Over 200 experimental variants explore pretraining methods, weight initialization algorithms, activation functions, floating point precision settings, SVD drivers and more.
+- Robust recovery of the last hidden dimension and latent size is consistently achievable under the stated assumptions.
+- In several configurations, all decoder layer sizes become visible in the singular spectrum, as shown in Figure 1.
+
+![Best spectra obtained](figures/best_result.png)
+
+**Figure 1.** Example spectrum revealing each decoder dimension. <br>
+
+**Key insights**
+- Higher precision (`fp64`) consistently improves separability of singular values.
+- Orthogonal initialization exposes spectral structure more clearly than pretrained weights.
+- Iterative recovery beyond the last layer fails because intervening nonsingular transformations obscure upstream structure.
+
+### Representation recovery metrics
+| **Metric** | **Formula** |
+|:------------|:-------------:|
+| **Relative Frobenius Error** | $f_{\mathrm{frob}}(\bar{H}, \tilde{H}) = \frac{\lVert \bar{H} - \tilde{H} \rVert_F}{\lVert \bar{H} \rVert_F}$ |
+| **Mean Cosine Similarity** | $f_{\mathrm{cos}}(\bar{H}, \tilde{H}) = \frac{1}{k} \sum_{j=1}^{k} \frac{\bar{H}[:,j]^\top \tilde{H}[:,j]}{\lVert \bar{H}[:,j] \rVert_2 \lVert \tilde{H}[:,j] \rVert_2}$ |
+| **Zero Overlap** | $f_\mathrm{zero}\left(\bar{H}, \tilde{H}\right) = \frac{\lvert \{(i,j) : \bar{H}[i,j] = 0 \land \tilde{H}[i,j]=0} \rvert}{\vert \\{(i,j) : \bar{H}[i,j] = 0\\} \vert}$ |
+| **Random Baseline** | $f_{\mathrm{rand}}(\bar{H}) = \frac{1}{n} \sum_{k=1}^{n} \frac{\lVert \bar{H} - R_k \rVert_F}{\lVert \bar{H} \rVert_F}$ |
+| **Relative Reconstruction Error** | $f_{\mathrm{rel}} = \frac{\lVert \bar{H} O^{\top} - \hat{H} \hat{O}^{\top} \rVert_F}{\lVert \bar{H} O^{\top} \rVert_F}$ |
+
+
+
+The evaluation pipeline reports cosine similarity, relative Frobenius error, zero-pattern agreement, and random baselines. Hungarian matching resolves permutation ambiguity before scoring. Cosine similarity reaches approximately 0.86 on smallest matrices, while the best-case relative Frobenius error is around 0.6 after alignment.
+
+<p align="center">
+  <table style="width:100%; border-collapse:collapse;">
+    <tr>
+      <td align="center" width="50%">
+        <img src="figures/frobenius_error_heatmap.png" width="100%"><br>
+        <sub><b>Figure 2a.</b> Relative Frobenius Error Heatmap</sub>
+      </td>
+      <td align="center" width="50%">
+        <img src="figures/cosine_similarity_heatmap.png" width="100%"><br>
+        <sub><b>Figure 2b.</b> Mean Cosine Similarity Heatmap</sub>
+      </td>
+    </tr>
+    <tr>
+      <td align="center" width="50%">
+        <img src="figures/zero_overlap_heatmap.png" width="100%"><br>
+        <sub><b>Figure 2c.</b> Zero Overlap Heatmap</sub>
+      </td>
+      <td align="center" width="50%">
+        <img src="figures/rand_baseline_heatmap.png" width="100%"><br>
+        <sub><b>Figure 2d.</b> Random Baseline Heatmap</sub>
+      </td>
+    </tr>
+    <tr>
+      <td align="center" width="50%">
+        <img src="figures/reconstruction_heatmap.png" width="100%"><br>
+        <sub><b>Figure 2e.</b> Relative Reconstruction Error Heatmap</sub>
+      </td>
+      <td align="center" width="50%">
+        <img src="figures/computation_time_heatmap.png" width="100%"><br>
+        <sub><b>Figure 2f.</b> Computation Times Heatmap</sub>
+      </td>
+    </tr>
+  </table>
+  <p align="center"><b>Figure 2.</b> Summary of experimental results.</p>
+</p>
+
+---
+
+## Notation
+
+| Symbol | Description | Shape |
+|--------|-------------|-------|
+| $n$ | Number of queries | $-$ |
+| $V$ | Vocabulary size (logits dimension) | $-$ |
+| $h$ | Ground truth hidden dimension | $-$ |
+| $Q$ | Matrix of queried logits | $V \times n$ |
+| $A^\dagger$ | Moore-Penrose inverse: $A^\dagger = \left(A^{\top} A\right)^{-1} A^{\top}$ | shape of $A^\top$ |
+| $U, \Sigma, V$ | SVD factors of $Q$ | $V \times h$, $h \times h$, $n \times h$ |
+| $W$ | Projection matrix | $V \times h$ |
+| $\tilde{W}$ | Recovered projection matrix | $V \times h$ |
+| $H$ | Hidden representations corresponding to $Q$ | $n \times h$ |
+| $G$ | Unknown nonsingular matrix | $h \times h$ |
+| $W_\gamma$ | Projection matrix folded with normalization weights | $V \times h$ |
+| $O$ | Unknown orthogonal alignment matrix | $h \times h$ |
+| $\bar{H}$ | Ground-truth *normalized* hidden representations | $n \times h$ |
+| $X$ | Known matrix, product of $\left(Q^\top, W_{\gamma}^{\dagger \top}\right)$ or $\left(\bar{H}, O^\top\right)$ | $V \times h$ |
+| $\theta$ | Vector of Givens rotation angles | $\frac{h \cdot (h - 1)}{2} \times 1$ |
+| $\hat{H}$ | Recovered hidden representations | $n \times h$ |
+| $\hat{O}$ | Estimated orthogonal factor | $h \times h$ |
+
+---
+
+## References
+
+- <a id="ref-carlini"></a>Carlini, N., Paleka, D., Dj Dvijotham, K., Steinke, T., Hayase, J., Cooper, A. F., Lee, K., Jagielski, M., Nasr, M., Conmy, A., Yona, I., Wallace, E., Rolnick, D., & Tramèr, F. (2024). *Stealing Part of a Production Language Model*. arXiv:2403.06634. https://arxiv.org/abs/2403.06634
+- <a id="ref-finlayson"></a>Finlayson, M., Ren, X., & Swayamdipta, S. (2024). *Logits of API-Protected LLMs Leak Proprietary Information*. arXiv:2403.09539. https://arxiv.org/abs/2403.09539
+
+- <a id="ref-wikipedia-procrustes"></a>Wikipedia contributors. (2023). Orthogonal Procrustes problem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem
+- <a id="ref-shalit"></a>Shalit, U., & Chechik, G. (2013). Efficient coordinate-descent for orthogonal matrices through Givens rotations. arXiv:1312.0624. https://arxiv.org/abs/1312.0624
+---
diff --git a/figures/best_result.png b/figures/best_result.png
diff --git a/figures/computation_time_heatmap.png b/figures/computation_time_heatmap.png
diff --git a/figures/cosine_similarity_heatmap.png b/figures/cosine_similarity_heatmap.png
diff --git a/figures/frobenius_error_heatmap.png b/figures/frobenius_error_heatmap.png
diff --git a/figures/rand_baseline_heatmap.png b/figures/rand_baseline_heatmap.png
diff --git a/figures/reconstruction_heatmap.png b/figures/reconstruction_heatmap.png
diff --git a/figures/zero_overlap_heatmap.png b/figures/zero_overlap_heatmap.png
-Original file line number
+Diff line change
@@ -1,5 +1,6 @@
     *.npy
     *.png
+    !figures/*.png
     *.pt
     *.ods
     *.json
@@ Expand Down @@