Add security review for in-process result authentication

Josu San Martin · Josu San Martin · commit 1d62323d82e4 · 2026-03-07T20:29:03.000-06:00
diff --git a/README.md b/README.md
@@ -75,3 +75,7 @@ again to minimize the window of opportunity for cheating by writing results from
 small effect on performance, as during the tail of the user kernel blocks of the test kernel are already put on the SMs
 and generate memory traffic. In the checking kernel, the order in which blocks are checked is randomized, so that it is
 not a viable strategy to only write the later blocks of the result from an unsynchronized stream.
+
+## Security Review
+
+A repository-level security note for the remaining in-process trust-boundary issues is documented in [SECURITY_REVIEW.md](SECURITY_REVIEW.md).
diff --git a/SECURITY_REVIEW.md b/SECURITY_REVIEW.md
@@ -0,0 +1,211 @@
+# Security Review: Result Authentication and In-Process Secret Exposure
+
+## Summary
+
+`pygpubench` is a substantial improvement over a pure-Python evaluator, but it still keeps trusted benchmark state inside the same process as untrusted submission code.
+
+That remaining trust-boundary issue appears to allow a malicious submission to forge benchmark output without running the intended kernel, by recovering the child-process result-authentication secret from memory and writing attacker-controlled results into the inherited result channel before the real benchmark loop completes.
+
+This document is intentionally disclosure-oriented:
+
+- it does **not** include exploit payloads
+- it does **not** include step-by-step reproduction code
+- it focuses on the issue, impact, and remediation options
+
+## Main Finding
+
+The current isolated benchmarking path still relies on a secret known to the worker process itself.
+
+At a high level:
+
+1. the parent process creates a benchmark subprocess
+2. a result-authentication secret is delivered to the child
+3. the child stores that secret in process memory
+4. untrusted Python submission code runs in that same process
+5. the child still has access to the result channel used to report benchmark results
+
+That means the worker can potentially:
+
+- recover the secret from its own address space
+- emit forged benchmark results carrying the correct authentication material
+- cause the parent to accept attacker-controlled timings
+
+In other words, the integrity of the result channel still depends on trusting code inside the process being benchmarked.
+
+## Why This Matters
+
+If the child process can authenticate arbitrary forged results, then:
+
+- reported timings no longer prove that the benchmarked kernel actually ran
+- reported error counts no longer prove that correctness checks actually passed
+- benchmark output can be made arbitrarily small or otherwise attacker-controlled
+
+This is a benchmark-integrity failure, not just a local implementation bug.
+
+## Additional Weaknesses Observed
+
+The signature/authentication issue is the primary concern. Several secondary findings make exploitation easier or increase future risk:
+
+### 1. GC-visible benchmark metadata
+
+At import time, Python-visible objects can still reveal useful benchmark structure such as:
+
+- number of repeats
+- output tensor metadata
+- tolerance information
+
+Even if tensor payloads are protected better than before, metadata leakage reduces attacker uncertainty.
+
+### 2. Warmup predictability
+
+If warmup always uses a deterministic case or stable pointer pattern, a malicious kernel may distinguish warmup from measured iterations and adapt its behavior accordingly.
+
+### 3. NaN wildcard handling
+
+Any checker behavior that treats NaN in expected data as “accept anything” is dangerous. Even if not immediately exploitable through the current path, it creates a latent bypass if expected-output addresses or copies become observable later.
+
+### 4. Overly broad in-process capability
+
+Untrusted Python code still runs with:
+
+- arbitrary `ctypes` access
+- process-memory visibility
+- inherited file descriptors / pipes
+- normal Python runtime introspection
+
+That combination is enough to make “secret inside the same process” a weak design.
+
+## What `pygpubench` Already Improves
+
+This report should not obscure the fact that `pygpubench` already fixes important problems that affect naive Python evaluators.
+
+Compared to an in-process pure-Python benchmark harness, `pygpubench` materially improves resistance against:
+
+- Python monkeypatching of timer objects
+- direct patching of Python reference/evaluator functions
+- trivial caching of user-visible tensors
+- some stream-ordering and L2-cache based reward hacking
+
+So the right framing is:
+
+- the architecture is **better**
+- but the remaining secret/result-channel design is still not strong enough for adversarial benchmarking
+
+## Root Cause
+
+The benchmark subprocess is simultaneously:
+
+- the environment running untrusted code
+- the holder of trusted result-authentication state
+
+As long as the child both:
+
+1. possesses the authentication material, and
+2. can write to the channel accepted by the parent,
+
+the scheme is vulnerable in principle.
+
+The security model still assumes the worker can be trusted with some benchmark-control state. In an adversarial benchmark, it cannot.
+
+## Recommended Fixes
+
+## 1. Do not keep the authentication secret alive inside untrusted execution
+
+Any key, signature, token, or HMAC material used to authenticate results should not remain recoverable after untrusted Python code starts executing.
+
+At minimum:
+
+- generate the key in trusted code
+- consume it before importing the submission
+- explicitly overwrite it in the child
+
+This reduces the straightforward memory-recovery path.
+
+## 2. Move result authentication to a trusted boundary
+
+A stronger fix is to ensure that the worker process never has the ability to authenticate arbitrary forged results.
+
+Two good directions:
+
+- trusted validator/orchestrator process owns the authentication
+- worker emits only raw events/results, never authenticated final records
+
+The parent should authenticate data that the worker cannot forge by construction.
+
+## 3. Reduce in-process attack surface
+
+Possible hardening measures:
+
+- restrict inherited file descriptors
+- tighten seccomp / syscall policy where practical
+- minimize procfs visibility where practical
+- reduce or remove unnecessary writable/redirectable channels in the worker
+
+These are secondary mitigations, not substitutes for a correct trust boundary.
+
+## 4. Remove or minimize metadata leakage
+
+Before importing the submission:
+
+- drop Python references that are no longer needed
+- clean up transient objects that reveal benchmark layout
+- avoid keeping expected-output metadata visible through ordinary Python object traversal
+
+## 5. Randomize warmup and pre-measurement behavior
+
+Avoid deterministic warmup patterns that allow the kernel to distinguish:
+
+- warmup
+- estimated timing
+- real benchmark passes
+
+## 6. Fail closed on suspicious checker states
+
+Examples:
+
+- NaN in expected outputs should be treated as a benchmark-generation failure, not a wildcard
+- malformed or incomplete child output should fail hard
+- any mismatch in result structure should fail hard
+
+## Suggested Architectural Direction
+
+### Option A: Trusted validator split
+
+Use a three-role model:
+
+- orchestrator: trusted, owns benchmark policy and result acceptance
+- worker: untrusted, only runs the kernel
+- validator: trusted, checks correctness and/or authenticates the final result
+
+The worker should not be able to independently produce a parent-acceptable final benchmark record.
+
+### Option B: Transitional hardening
+
+If a full redesign is not immediately feasible:
+
+1. remove recoverable result-authentication state before user import
+2. aggressively reduce inherited descriptors/capabilities
+3. clear Python-visible benchmark metadata before import
+4. add tamper detection for result-channel anomalies
+
+This would still be weaker than a proper split-process trust model, but materially better than the current design.
+
+## Recommended Next Steps
+
+1. treat this as a security bug affecting adversarial benchmark integrity
+2. review the result-authentication path end-to-end
+3. patch the child secret lifetime / ownership problem first
+4. then follow up with capability reduction and metadata cleanup
+5. publish a brief security note once the fix lands
+
+## Scope Of This Document
+
+This file is intended to support remediation planning inside the repository.
+
+It does **not**:
+
+- provide exploit code
+- attribute methods to any third party
+- claim that every theoretical path has been weaponized
+
+It only records that the current result-authentication design remains vulnerable because trusted benchmark state is still exposed inside the untrusted worker process.