Skip to content

kamil271e/ml-model-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stealing the Properties of Black-Box Deep Learning Models

View Paper

A research toolkit for inferring hidden structure from restricted (black-box) model access.

This repository investigates how much internal information can be recovered from machine-learning services that expose only query responses. We study both theoretical and empirical avenues for reconstructing projection matrices, hidden dimensions, and intermediate representations. The findings inform mitigations for API-protected models and complement recent analyses of production model theft Carlini et al., 2024 and logits leakage Finlayson et al., 2024.

🏆 The biggest achievement is the ability to recover hidden dimensions of black-box autoencoder architectures (see Singular value spectrum analysis).


Overview

  • Works in strict black-box regimes with only input/output access.
  • Targets projection matrices, hidden dimensions, and post-normalization activations.
  • Provides reproducible experiments, evaluation utilities, and visualization assets.
  • Complements the accompanying paper (link here).

Core Contributions

  • Implements two SVD-driven recovery pipelines:
    • Algebraic extraction recovers projection matrices up to a nonsingular transformation.
    • Geometric extraction recovers normalized projection matrices up to an orthogonal factor.
  • Supplies tooling for spectrum analysis, hidden dimension estimation, data generation, and evaluation.
  • Introduces an algorithm for reconstructing hidden representations via Givens rotations [Shalit & Chechik, 2013], the orthogonal Procrustes objective, and Adam optimization.

Methodology

Algebraic extraction

Recovers an approximation of the projection matrix $\tilde{W}$.

Assumes access to logits and an exploitable end-of-network bottleneck where the final dimension exceeds the penultimate one.

Constructs a logits matrix $Q$ from API responses, with each column containing one logit vector.

Inspects the singular spectrum to estimate the hidden dimension $h$; a pronounced drop reveals a low-rank structure. A truncated decomposition yields a projection matrix up to a nonsingular transformation $G$.

$\,$

$Q = U \Sigma V^{\top}$

$W H^{\top} = U \Sigma V^{\top}$

$W H^{\top} V = U \Sigma$

$G = H^{\top} V$

$\tilde{W} = W G$

$G \in \mathbb{R}^{h \times h}$

Geometric extraction

Normalizing layers (RMSNorm / LayerNorm) project activations onto a sphere. The resulting ellipsoidal constraints define a linear system whose solution recovers the folded projection matrix up to orthogonality:

$$ \tilde{W}_\gamma = W_\gamma O^{\top}, $$

where $W_\gamma$ folds the projection matrix with normalization weights $\gamma$ and $O$ is an orthogonal matrix.

Representation recovery

Targets post-ReLU, normalized hidden states. Combines Givens-parameterized orthogonal matrices, an orthogonal Procrustes alignment step, and Adam to align recovered states with the ground-truth structure under sparsity and normalization constraints. The optimization variable $\theta$ encodes the Givens rotation angles. $X$ is known matrix used during optimization - derived in chapter 3.3 of the paper.

$$ X = Q^\top \cdot W_{\gamma}^{\dagger \top} = \bar{H} \cdot O^\top $$

Pseudocode: Factorization algorithm

Input: $X \in \mathbb{R}^{V \times h}$, epochs $E$, learning rate $lr$, alternating steps $T$

Output: $\hat{H}$, $\hat{O}$

θ = init_theta(h)

for epoch in range(E):
    Ô = build_O_from_givens(θ)
    loss = objective(X, Ô)
    θ = adam_step(θ, grad(loss, θ))
    if diverged(loss):
        break

Ô = build_O_from_givens(θ)
Ĥ = X @ Ô

for _ in range(T):
    Ĥ = clamp_and_normalize(Ĥ)
    Ô = procrustes(X, Ĥ)
    Ĥ = X @ Ô

Ĥ = clamp_and_normalize(Ĥ)
return Ĥ, Ô

Loss function:

$$ \mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2, $$

with

$$ \begin{aligned} \mathcal{L}_1 &= \sum_{i,j} \max(0, -\hat{H}_{ij}) \\ \mathcal{L}_2 &= \sum_{i=1}^{n} \left(\lVert \hat{H}[i,:]\rVert_2 - 1\right)^2. \end{aligned} $$


Evaluation and Experiments

Singular value spectrum analysis

  • Autoencoder architectures with bottlenecks between decoder layers provide promising settings for spectral analyses across the stack.
  • Over 200 experimental variants explore pretraining methods, weight initialization algorithms, activation functions, floating point precision settings, SVD drivers and more.
  • Robust recovery of the last hidden dimension and latent size is consistently achievable under the stated assumptions.
  • In several configurations, all decoder layer sizes become visible in the singular spectrum, as shown in Figure 1.

Best spectra obtained

Figure 1. Example spectrum revealing each decoder dimension.

Key insights

  • Higher precision (fp64) consistently improves separability of singular values.
  • Orthogonal initialization exposes spectral structure more clearly than pretrained weights.
  • Iterative recovery beyond the last layer fails because intervening nonsingular transformations obscure upstream structure.

Representation recovery metrics

Metric Formula
Relative Frobenius Error $f_{\mathrm{frob}}(\bar{H}, \tilde{H}) = \frac{\lVert \bar{H} - \tilde{H} \rVert_F}{\lVert \bar{H} \rVert_F}$
Mean Cosine Similarity $f_{\mathrm{cos}}(\bar{H}, \tilde{H}) = \frac{1}{k} \sum_{j=1}^{k} \frac{\bar{H}[:,j]^\top \tilde{H}[:,j]}{\lVert \bar{H}[:,j] \rVert_2 \lVert \tilde{H}[:,j] \rVert_2}$
Zero Overlap $f_\mathrm{zero}\left(\bar{H}, \tilde{H}\right) = \frac{\lvert {(i,j) : \bar{H}[i,j] = 0 \land \tilde{H}[i,j]=0} \rvert}{\vert \{(i,j) : \bar{H}[i,j] = 0\} \vert}$
Random Baseline $f_{\mathrm{rand}}(\bar{H}) = \frac{1}{n} \sum_{k=1}^{n} \frac{\lVert \bar{H} - R_k \rVert_F}{\lVert \bar{H} \rVert_F}$
Relative Reconstruction Error $f_{\mathrm{rel}} = \frac{\lVert \bar{H} O^{\top} - \hat{H} \hat{O}^{\top} \rVert_F}{\lVert \bar{H} O^{\top} \rVert_F}$

The evaluation pipeline reports cosine similarity, relative Frobenius error, zero-pattern agreement, and random baselines. Hungarian matching resolves permutation ambiguity before scoring. Cosine similarity reaches approximately 0.86 on smallest matrices, while the best-case relative Frobenius error is around 0.6 after alignment.


Figure 2a. Relative Frobenius Error Heatmap

Figure 2b. Mean Cosine Similarity Heatmap

Figure 2c. Zero Overlap Heatmap

Figure 2d. Random Baseline Heatmap

Figure 2e. Relative Reconstruction Error Heatmap

Figure 2f. Computation Times Heatmap

Figure 2. Summary of experimental results.


Notation

Symbol Description Shape
$n$ Number of queries $-$
$V$ Vocabulary size (logits dimension) $-$
$h$ Ground truth hidden dimension $-$
$Q$ Matrix of queried logits $V \times n$
$A^\dagger$ Moore-Penrose inverse: $A^\dagger = \left(A^{\top} A\right)^{-1} A^{\top}$ shape of $A^\top$
$U, \Sigma, V$ SVD factors of $Q$ $V \times h$, $h \times h$, $n \times h$
$W$ Projection matrix $V \times h$
$\tilde{W}$ Recovered projection matrix $V \times h$
$H$ Hidden representations corresponding to $Q$ $n \times h$
$G$ Unknown nonsingular matrix $h \times h$
$W_\gamma$ Projection matrix folded with normalization weights $V \times h$
$O$ Unknown orthogonal alignment matrix $h \times h$
$\bar{H}$ Ground-truth normalized hidden representations $n \times h$
$X$ Known matrix, product of $\left(Q^\top, W_{\gamma}^{\dagger \top}\right)$ or $\left(\bar{H}, O^\top\right)$ $V \times h$
$\theta$ Vector of Givens rotation angles $\frac{h \cdot (h - 1)}{2} \times 1$
$\hat{H}$ Recovered hidden representations $n \times h$
$\hat{O}$ Estimated orthogonal factor $h \times h$

References


About

Research toolkit for reverse-engineering neural networks in strict black-box settings

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors