Stealing the Properties of Black-Box Deep Learning Models

A research toolkit for inferring hidden structure from restricted (black-box) model access.

This repository investigates how much internal information can be recovered from machine-learning services that expose only query responses. We study both theoretical and empirical avenues for reconstructing projection matrices, hidden dimensions, and intermediate representations. The findings inform mitigations for API-protected models and complement recent analyses of production model theft Carlini et al., 2024 and logits leakage Finlayson et al., 2024.

🏆 The biggest achievement is the ability to recover hidden dimensions of black-box autoencoder architectures (see Singular value spectrum analysis).

Overview

Works in strict black-box regimes with only input/output access.
Targets projection matrices, hidden dimensions, and post-normalization activations.
Provides reproducible experiments, evaluation utilities, and visualization assets.
Complements the accompanying paper (link here).

Core Contributions

Implements two SVD-driven recovery pipelines:
- Algebraic extraction recovers projection matrices up to a nonsingular transformation.
- Geometric extraction recovers normalized projection matrices up to an orthogonal factor.
Supplies tooling for spectrum analysis, hidden dimension estimation, data generation, and evaluation.
Introduces an algorithm for reconstructing hidden representations via Givens rotations [Shalit & Chechik, 2013], the orthogonal Procrustes objective, and Adam optimization.

Methodology

Algebraic extraction

Recovers an approximation of the projection matrix $\tilde{W}$.

Assumes access to logits and an exploitable end-of-network bottleneck where the final dimension exceeds the penultimate one.

Constructs a logits matrix $Q$ from API responses, with each column containing one logit vector.

Inspects the singular spectrum to estimate the hidden dimension $h$; a pronounced drop reveals a low-rank structure. A truncated decomposition yields a projection matrix up to a nonsingular transformation $G$.

$\,$

$Q = U \Sigma V^{\top}$

$W H^{\top} = U \Sigma V^{\top}$

$W H^{\top} V = U \Sigma$

$G = H^{\top} V$

$\tilde{W} = W G$

$G \in \mathbb{R}^{h \times h}$

Geometric extraction

Normalizing layers (RMSNorm / LayerNorm) project activations onto a sphere. The resulting ellipsoidal constraints define a linear system whose solution recovers the folded projection matrix up to orthogonality:

$$ \tilde{W}_\gamma = W_\gamma O^{\top}, $$

where $W_\gamma$ folds the projection matrix with normalization weights $\gamma$ and $O$ is an orthogonal matrix.

Representation recovery

Targets post-ReLU, normalized hidden states. Combines Givens-parameterized orthogonal matrices, an orthogonal Procrustes alignment step, and Adam to align recovered states with the ground-truth structure under sparsity and normalization constraints. The optimization variable $\theta$ encodes the Givens rotation angles. $X$ is known matrix used during optimization - derived in chapter 3.3 of the paper.

$$ X = Q^\top \cdot W_{\gamma}^{\dagger \top} = \bar{H} \cdot O^\top $$

Pseudocode: Factorization algorithm

Input: $X \in \mathbb{R}^{V \times h}$, epochs $E$, learning rate $lr$, alternating steps $T$

Output: $\hat{H}$, $\hat{O}$

θ = init_theta(h)

for epoch in range(E):
    Ô = build_O_from_givens(θ)
    loss = objective(X, Ô)
    θ = adam_step(θ, grad(loss, θ))
    if diverged(loss):
        break

Ô = build_O_from_givens(θ)
Ĥ = X @ Ô

for _ in range(T):
    Ĥ = clamp_and_normalize(Ĥ)
    Ô = procrustes(X, Ĥ)
    Ĥ = X @ Ô

Ĥ = clamp_and_normalize(Ĥ)
return Ĥ, Ô

Loss function:

$$ \mathcal{L} = \lambda_{1} \mathcal{L}_1 + \lambda_{2} \mathcal{L}_2, $$

with

$$ \begin{aligned} \mathcal{L}_1 &= \sum_{i,j} \max(0, -\hat{H}_{ij}) \\ \mathcal{L}_2 &= \sum_{i=1}^{n} \left(\lVert \hat{H}[i,:]\rVert_2 - 1\right)^2. \end{aligned} $$

Evaluation and Experiments

Singular value spectrum analysis

Autoencoder architectures with bottlenecks between decoder layers provide promising settings for spectral analyses across the stack.
Over 200 experimental variants explore pretraining methods, weight initialization algorithms, activation functions, floating point precision settings, SVD drivers and more.
Robust recovery of the last hidden dimension and latent size is consistently achievable under the stated assumptions.
In several configurations, all decoder layer sizes become visible in the singular spectrum, as shown in Figure 1.

Figure 1. Example spectrum revealing each decoder dimension.

Key insights

Higher precision (fp64) consistently improves separability of singular values.
Orthogonal initialization exposes spectral structure more clearly than pretrained weights.
Iterative recovery beyond the last layer fails because intervening nonsingular transformations obscure upstream structure.

Representation recovery metrics

Metric	Formula
Relative Frobenius Error	$f_{\mathrm{frob}}(\bar{H}, \tilde{H}) = \frac{\lVert \bar{H} - \tilde{H} \rVert_F}{\lVert \bar{H} \rVert_F}$
Mean Cosine Similarity	$f_{\mathrm{cos}}(\bar{H}, \tilde{H}) = \frac{1}{k} \sum_{j=1}^{k} \frac{\bar{H}[:,j]^\top \tilde{H}[:,j]}{\lVert \bar{H}[:,j] \rVert_2 \lVert \tilde{H}[:,j] \rVert_2}$
Zero Overlap	$f_\mathrm{zero}\left(\bar{H}, \tilde{H}\right) = \frac{\lvert {(i,j) : \bar{H}[i,j] = 0 \land \tilde{H}[i,j]=0} \rvert}{\vert \{(i,j) : \bar{H}[i,j] = 0\} \vert}$
Random Baseline	$f_{\mathrm{rand}}(\bar{H}) = \frac{1}{n} \sum_{k=1}^{n} \frac{\lVert \bar{H} - R_k \rVert_F}{\lVert \bar{H} \rVert_F}$
Relative Reconstruction Error	$f_{\mathrm{rel}} = \frac{\lVert \bar{H} O^{\top} - \hat{H} \hat{O}^{\top} \rVert_F}{\lVert \bar{H} O^{\top} \rVert_F}$

The evaluation pipeline reports cosine similarity, relative Frobenius error, zero-pattern agreement, and random baselines. Hungarian matching resolves permutation ambiguity before scoring. Cosine similarity reaches approximately 0.86 on smallest matrices, while the best-case relative Frobenius error is around 0.6 after alignment.

_{Figure 2a. Relative Frobenius Error Heatmap}	_{Figure 2b. Mean Cosine Similarity Heatmap}
_{Figure 2c. Zero Overlap Heatmap}	_{Figure 2d. Random Baseline Heatmap}
_{Figure 2e. Relative Reconstruction Error Heatmap}	_{Figure 2f. Computation Times Heatmap}

Figure 2. Summary of experimental results.

Notation

Symbol	Description	Shape
$n$	Number of queries	$-$
$V$	Vocabulary size (logits dimension)	$-$
$h$	Ground truth hidden dimension	$-$
$Q$	Matrix of queried logits	$V \times n$
$A^\dagger$	Moore-Penrose inverse: $A^\dagger = \left(A^{\top} A\right)^{-1} A^{\top}$	shape of $A^\top$
$U, \Sigma, V$	SVD factors of $Q$	$V \times h$, $h \times h$, $n \times h$
$W$	Projection matrix	$V \times h$
$\tilde{W}$	Recovered projection matrix	$V \times h$
$H$	Hidden representations corresponding to $Q$	$n \times h$
$G$	Unknown nonsingular matrix	$h \times h$
$W_\gamma$	Projection matrix folded with normalization weights	$V \times h$
$O$	Unknown orthogonal alignment matrix	$h \times h$
$\bar{H}$	Ground-truth normalized hidden representations	$n \times h$
$X$	Known matrix, product of $\left(Q^\top, W_{\gamma}^{\dagger \top}\right)$ or $\left(\bar{H}, O^\top\right)$	$V \times h$
$\theta$	Vector of Givens rotation angles	$\frac{h \cdot (h - 1)}{2} \times 1$
$\hat{H}$	Recovered hidden representations	$n \times h$
$\hat{O}$	Estimated orthogonal factor	$h \times h$

References

Carlini, N., Paleka, D., Dj Dvijotham, K., Steinke, T., Hayase, J., Cooper, A. F., Lee, K., Jagielski, M., Nasr, M., Conmy, A., Yona, I., Wallace, E., Rolnick, D., & Tramèr, F. (2024). Stealing Part of a Production Language Model. arXiv:2403.06634. https://arxiv.org/abs/2403.06634
Finlayson, M., Ren, X., & Swayamdipta, S. (2024). Logits of API-Protected LLMs Leak Proprietary Information. arXiv:2403.09539. https://arxiv.org/abs/2403.09539
Wikipedia contributors. (2023). Orthogonal Procrustes problem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem
Shalit, U., & Chechik, G. (2013). Efficient coordinate-descent for orthogonal matrices through Givens rotations. arXiv:1312.0624. https://arxiv.org/abs/1312.0624

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github		.github
docs		docs
figures		figures
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stealing the Properties of Black-Box Deep Learning Models

Overview

Core Contributions

Methodology

Algebraic extraction

Geometric extraction

Representation recovery

Pseudocode: Factorization algorithm

Evaluation and Experiments

Singular value spectrum analysis

Representation recovery metrics

Notation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stealing the Properties of Black-Box Deep Learning Models

Overview

Core Contributions

Methodology

Algebraic extraction

Geometric extraction

Representation recovery

Pseudocode: Factorization algorithm

Evaluation and Experiments

Singular value spectrum analysis

Representation recovery metrics

Notation

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages