Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
*.npy
*.png
!figures/*.png
*.pt
*.ods
*.json
Expand Down
229 changes: 228 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,228 @@
# ml-model-extraction
# Stealing the Properties of Black-Box Deep Learning Models

[![View Paper](https://img.shields.io/badge/View-Paper-blue?style=for-the-badge)](https://drive.google.com/file/d/1Ya0daEysYh4WrCZDu6l9oqQLs_eGaD8R/view?usp=sharing)

*A research toolkit for inferring hidden structure from restricted (black-box) model access.*

This repository investigates how much internal information can be recovered from machine-learning services that expose only query responses. We study both theoretical and empirical avenues for reconstructing projection matrices, hidden dimensions, and intermediate representations. The findings inform mitigations for API-protected models and complement recent analyses of production model theft [Carlini et al., 2024](#ref-carlini) and logits leakage [Finlayson et al., 2024](#ref-finlayson).


> 🏆 The biggest achievement is the ability to recover hidden dimensions of black-box autoencoder architectures (see [Singular value spectrum analysis](#singular-value-spectrum-analysis)).

---

## Overview
- Works in strict black-box regimes with only input/output access.
- Targets projection matrices, hidden dimensions, and post-normalization activations.
- Provides reproducible experiments, evaluation utilities, and visualization assets.
- Complements the accompanying paper (link [here](https://drive.google.com/file/d/1Ya0daEysYh4WrCZDu6l9oqQLs_eGaD8R/view?usp=sharing)).

## Core Contributions
- Implements two **SVD-driven recovery pipelines**:
- **Algebraic extraction** recovers projection matrices up to a nonsingular transformation.
- **Geometric extraction** recovers normalized projection matrices up to an orthogonal factor.
- Supplies tooling for spectrum analysis, hidden dimension estimation, data generation, and evaluation.
- Introduces an algorithm for reconstructing hidden representations via **Givens rotations** [<a href="#ref-shalit">Shalit & Chechik, 2013</a>], the **orthogonal <a href="#ref-wikipedia-procrustes">Procrustes</a> objective**, and **Adam** optimization.

---

## Methodology
### Algebraic extraction

<table>
<tr>
<td width="55%" valign="top">

Recovers an approximation of the projection matrix $\tilde{W}$.

Assumes access to logits and an exploitable end-of-network bottleneck where the final dimension exceeds the penultimate one.

Constructs a logits matrix $Q$ from API responses, with each column containing one logit vector.

Inspects the singular spectrum to estimate the hidden dimension $h$; a pronounced drop reveals a low-rank structure. A truncated decomposition yields a projection matrix up to a nonsingular transformation $G$.

</td>
<td width="45%" valign="top" align="center">
$\,$

$Q = U \Sigma V^{\top}$

$W H^{\top} = U \Sigma V^{\top}$

$W H^{\top} V = U \Sigma$

$G = H^{\top} V$

$\tilde{W} = W G$

$G \in \mathbb{R}^{h \times h}$

</td>
</tr>
</table>






### Geometric extraction
Normalizing layers (RMSNorm / LayerNorm) project activations onto a sphere. The resulting ellipsoidal constraints define a linear system whose solution recovers the folded projection matrix up to orthogonality:

$$
\tilde{W}_\gamma = W_\gamma O^{\top},
$$

where $W_\gamma$ folds the projection matrix with normalization weights $\gamma$ and $O$ is an orthogonal matrix.

### Representation recovery
Targets post-ReLU, normalized hidden states. Combines Givens-parameterized orthogonal matrices, an orthogonal Procrustes alignment step, and Adam to align recovered states with the ground-truth structure under sparsity and normalization constraints. The optimization variable $\theta$ encodes the Givens rotation angles. $X$ is known matrix used during optimization - derived in chapter 3.3 of the paper.

$$
X = Q^\top \cdot W_{\gamma}^{\dagger \top} = \bar{H} \cdot O^\top
$$

#### Pseudocode: Factorization algorithm
Input: $X \in \mathbb{R}^{V \times h}$, epochs $E$, learning rate $lr$, alternating steps $T$

Output: $\hat{H}$, $\hat{O}$

```python
θ = init_theta(h)

for epoch in range(E):
Ô = build_O_from_givens(θ)
loss = objective(X, Ô)
θ = adam_step(θ, grad(loss, θ))
if diverged(loss):
break

Ô = build_O_from_givens(θ)
Ĥ = X @ Ô

for _ in range(T):
Ĥ = clamp_and_normalize(Ĥ)
Ô = procrustes(X, Ĥ)
Ĥ = X @ Ô

Ĥ = clamp_and_normalize(Ĥ)
return Ĥ, Ô
```

Loss function:
<!-- <pre> ```math \mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2, ``` </pre> -->
$$
\mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2,
$$

with
<!-- <pre> ```math \mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2, ``` </pre> -->
$$
\begin{aligned}
\mathcal{L}_1 &= \sum_{i,j} \max(0, -\hat{H}_{ij}) \\
\mathcal{L}_2 &= \sum_{i=1}^{n} \left(\lVert \hat{H}[i,:]\rVert_2 - 1\right)^2.
\end{aligned}
$$

---

## Evaluation and Experiments

### Singular value spectrum analysis
- Autoencoder architectures with bottlenecks between decoder layers provide promising settings for spectral analyses across the stack.
- Over 200 experimental variants explore pretraining methods, weight initialization algorithms, activation functions, floating point precision settings, SVD drivers and more.
- Robust recovery of the last hidden dimension and latent size is consistently achievable under the stated assumptions.
- In several configurations, all decoder layer sizes become visible in the singular spectrum, as shown in Figure 1.

![Best spectra obtained](figures/best_result.png)

**Figure 1.** Example spectrum revealing each decoder dimension. <br>

**Key insights**
- Higher precision (`fp64`) consistently improves separability of singular values.
- Orthogonal initialization exposes spectral structure more clearly than pretrained weights.
- Iterative recovery beyond the last layer fails because intervening nonsingular transformations obscure upstream structure.

### Representation recovery metrics
| **Metric** | **Formula** |
|:------------|:-------------:|
| **Relative Frobenius Error** | $f_{\mathrm{frob}}(\bar{H}, \tilde{H}) = \frac{\lVert \bar{H} - \tilde{H} \rVert_F}{\lVert \bar{H} \rVert_F}$ |
| **Mean Cosine Similarity** | $f_{\mathrm{cos}}(\bar{H}, \tilde{H}) = \frac{1}{k} \sum_{j=1}^{k} \frac{\bar{H}[:,j]^\top \tilde{H}[:,j]}{\lVert \bar{H}[:,j] \rVert_2 \lVert \tilde{H}[:,j] \rVert_2}$ |
| **Zero Overlap** | $f_\mathrm{zero}\left(\bar{H}, \tilde{H}\right) = \frac{\lvert \{(i,j) : \bar{H}[i,j] = 0 \land \tilde{H}[i,j]=0} \rvert}{\vert \\{(i,j) : \bar{H}[i,j] = 0\\} \vert}$ |
| **Random Baseline** | $f_{\mathrm{rand}}(\bar{H}) = \frac{1}{n} \sum_{k=1}^{n} \frac{\lVert \bar{H} - R_k \rVert_F}{\lVert \bar{H} \rVert_F}$ |
| **Relative Reconstruction Error** | $f_{\mathrm{rel}} = \frac{\lVert \bar{H} O^{\top} - \hat{H} \hat{O}^{\top} \rVert_F}{\lVert \bar{H} O^{\top} \rVert_F}$ |



The evaluation pipeline reports cosine similarity, relative Frobenius error, zero-pattern agreement, and random baselines. Hungarian matching resolves permutation ambiguity before scoring. Cosine similarity reaches approximately 0.86 on smallest matrices, while the best-case relative Frobenius error is around 0.6 after alignment.

<p align="center">
<table style="width:100%; border-collapse:collapse;">
<tr>
<td align="center" width="50%">
<img src="figures/frobenius_error_heatmap.png" width="100%"><br>
<sub><b>Figure 2a.</b> Relative Frobenius Error Heatmap</sub>
</td>
<td align="center" width="50%">
<img src="figures/cosine_similarity_heatmap.png" width="100%"><br>
<sub><b>Figure 2b.</b> Mean Cosine Similarity Heatmap</sub>
</td>
</tr>
<tr>
<td align="center" width="50%">
<img src="figures/zero_overlap_heatmap.png" width="100%"><br>
<sub><b>Figure 2c.</b> Zero Overlap Heatmap</sub>
</td>
<td align="center" width="50%">
<img src="figures/rand_baseline_heatmap.png" width="100%"><br>
<sub><b>Figure 2d.</b> Random Baseline Heatmap</sub>
</td>
</tr>
<tr>
<td align="center" width="50%">
<img src="figures/reconstruction_heatmap.png" width="100%"><br>
<sub><b>Figure 2e.</b> Relative Reconstruction Error Heatmap</sub>
</td>
<td align="center" width="50%">
<img src="figures/computation_time_heatmap.png" width="100%"><br>
<sub><b>Figure 2f.</b> Computation Times Heatmap</sub>
</td>
</tr>
</table>
<p align="center"><b>Figure 2.</b> Summary of experimental results.</p>
</p>

---

## Notation

| Symbol | Description | Shape |
|--------|-------------|-------|
| $n$ | Number of queries | $-$ |
| $V$ | Vocabulary size (logits dimension) | $-$ |
| $h$ | Ground truth hidden dimension | $-$ |
| $Q$ | Matrix of queried logits | $V \times n$ |
| $A^\dagger$ | Moore-Penrose inverse: $A^\dagger = \left(A^{\top} A\right)^{-1} A^{\top}$ | shape of $A^\top$ |
| $U, \Sigma, V$ | SVD factors of $Q$ | $V \times h$, $h \times h$, $n \times h$ |
| $W$ | Projection matrix | $V \times h$ |
| $\tilde{W}$ | Recovered projection matrix | $V \times h$ |
| $H$ | Hidden representations corresponding to $Q$ | $n \times h$ |
| $G$ | Unknown nonsingular matrix | $h \times h$ |
| $W_\gamma$ | Projection matrix folded with normalization weights | $V \times h$ |
| $O$ | Unknown orthogonal alignment matrix | $h \times h$ |
| $\bar{H}$ | Ground-truth *normalized* hidden representations | $n \times h$ |
| $X$ | Known matrix, product of $\left(Q^\top, W_{\gamma}^{\dagger \top}\right)$ or $\left(\bar{H}, O^\top\right)$ | $V \times h$ |
| $\theta$ | Vector of Givens rotation angles | $\frac{h \cdot (h - 1)}{2} \times 1$ |
| $\hat{H}$ | Recovered hidden representations | $n \times h$ |
| $\hat{O}$ | Estimated orthogonal factor | $h \times h$ |

---

## References

- <a id="ref-carlini"></a>Carlini, N., Paleka, D., Dj Dvijotham, K., Steinke, T., Hayase, J., Cooper, A. F., Lee, K., Jagielski, M., Nasr, M., Conmy, A., Yona, I., Wallace, E., Rolnick, D., & Tramèr, F. (2024). *Stealing Part of a Production Language Model*. arXiv:2403.06634. https://arxiv.org/abs/2403.06634
- <a id="ref-finlayson"></a>Finlayson, M., Ren, X., & Swayamdipta, S. (2024). *Logits of API-Protected LLMs Leak Proprietary Information*. arXiv:2403.09539. https://arxiv.org/abs/2403.09539

- <a id="ref-wikipedia-procrustes"></a>Wikipedia contributors. (2023). Orthogonal Procrustes problem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem
- <a id="ref-shalit"></a>Shalit, U., & Chechik, G. (2013). Efficient coordinate-descent for orthogonal matrices through Givens rotations. arXiv:1312.0624. https://arxiv.org/abs/1312.0624
---
Binary file added figures/best_result.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/computation_time_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/cosine_similarity_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/frobenius_error_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/rand_baseline_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/reconstruction_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/zero_overlap_heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.