A research toolkit for inferring hidden structure from restricted (black-box) model access.
This repository investigates how much internal information can be recovered from machine-learning services that expose only query responses. We study both theoretical and empirical avenues for reconstructing projection matrices, hidden dimensions, and intermediate representations. The findings inform mitigations for API-protected models and complement recent analyses of production model theft Carlini et al., 2024 and logits leakage Finlayson et al., 2024.
🏆 The biggest achievement is the ability to recover hidden dimensions of black-box autoencoder architectures (see Singular value spectrum analysis).
- Works in strict black-box regimes with only input/output access.
- Targets projection matrices, hidden dimensions, and post-normalization activations.
- Provides reproducible experiments, evaluation utilities, and visualization assets.
- Complements the accompanying paper (link here).
- Implements two SVD-driven recovery pipelines:
- Algebraic extraction recovers projection matrices up to a nonsingular transformation.
- Geometric extraction recovers normalized projection matrices up to an orthogonal factor.
- Supplies tooling for spectrum analysis, hidden dimension estimation, data generation, and evaluation.
- Introduces an algorithm for reconstructing hidden representations via Givens rotations [Shalit & Chechik, 2013], the orthogonal Procrustes objective, and Adam optimization.
|
Recovers an approximation of the projection matrix Assumes access to logits and an exploitable end-of-network bottleneck where the final dimension exceeds the penultimate one. Constructs a logits matrix Inspects the singular spectrum to estimate the hidden dimension |
|
Normalizing layers (RMSNorm / LayerNorm) project activations onto a sphere. The resulting ellipsoidal constraints define a linear system whose solution recovers the folded projection matrix up to orthogonality:
where
Targets post-ReLU, normalized hidden states. Combines Givens-parameterized orthogonal matrices, an orthogonal Procrustes alignment step, and Adam to align recovered states with the ground-truth structure under sparsity and normalization constraints. The optimization variable
Input:
Output:
θ = init_theta(h)
for epoch in range(E):
Ô = build_O_from_givens(θ)
loss = objective(X, Ô)
θ = adam_step(θ, grad(loss, θ))
if diverged(loss):
break
Ô = build_O_from_givens(θ)
Ĥ = X @ Ô
for _ in range(T):
Ĥ = clamp_and_normalize(Ĥ)
Ô = procrustes(X, Ĥ)
Ĥ = X @ Ô
Ĥ = clamp_and_normalize(Ĥ)
return Ĥ, ÔLoss function:
with
- Autoencoder architectures with bottlenecks between decoder layers provide promising settings for spectral analyses across the stack.
- Over 200 experimental variants explore pretraining methods, weight initialization algorithms, activation functions, floating point precision settings, SVD drivers and more.
- Robust recovery of the last hidden dimension and latent size is consistently achievable under the stated assumptions.
- In several configurations, all decoder layer sizes become visible in the singular spectrum, as shown in Figure 1.
Figure 1. Example spectrum revealing each decoder dimension.
Key insights
- Higher precision (
fp64) consistently improves separability of singular values. - Orthogonal initialization exposes spectral structure more clearly than pretrained weights.
- Iterative recovery beyond the last layer fails because intervening nonsingular transformations obscure upstream structure.
| Metric | Formula |
|---|---|
| Relative Frobenius Error | |
| Mean Cosine Similarity | |
| Zero Overlap | |
| Random Baseline | |
| Relative Reconstruction Error |
The evaluation pipeline reports cosine similarity, relative Frobenius error, zero-pattern agreement, and random baselines. Hungarian matching resolves permutation ambiguity before scoring. Cosine similarity reaches approximately 0.86 on smallest matrices, while the best-case relative Frobenius error is around 0.6 after alignment.
Figure 2. Summary of experimental results.
| Symbol | Description | Shape |
|---|---|---|
| Number of queries | ||
| Vocabulary size (logits dimension) | ||
| Ground truth hidden dimension | ||
| Matrix of queried logits | ||
| Moore-Penrose inverse: |
shape of |
|
| SVD factors of |
|
|
| Projection matrix | ||
| Recovered projection matrix | ||
| Hidden representations corresponding to |
||
| Unknown nonsingular matrix | ||
| Projection matrix folded with normalization weights | ||
| Unknown orthogonal alignment matrix | ||
| Ground-truth normalized hidden representations | ||
| Known matrix, product of |
||
| Vector of Givens rotation angles | ||
| Recovered hidden representations | ||
| Estimated orthogonal factor |
-
Carlini, N., Paleka, D., Dj Dvijotham, K., Steinke, T., Hayase, J., Cooper, A. F., Lee, K., Jagielski, M., Nasr, M., Conmy, A., Yona, I., Wallace, E., Rolnick, D., & Tramèr, F. (2024). Stealing Part of a Production Language Model. arXiv:2403.06634. https://arxiv.org/abs/2403.06634
-
Finlayson, M., Ren, X., & Swayamdipta, S. (2024). Logits of API-Protected LLMs Leak Proprietary Information. arXiv:2403.09539. https://arxiv.org/abs/2403.09539
-
Wikipedia contributors. (2023). Orthogonal Procrustes problem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem
-
Shalit, U., & Chechik, G. (2013). Efficient coordinate-descent for orthogonal matrices through Givens rotations. arXiv:1312.0624. https://arxiv.org/abs/1312.0624






