|
| 1 | +# Security Review: Result Authentication and In-Process Secret Exposure |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +`pygpubench` is a substantial improvement over a pure-Python evaluator, but it still keeps trusted benchmark state inside the same process as untrusted submission code. |
| 6 | + |
| 7 | +That remaining trust-boundary issue appears to allow a malicious submission to forge benchmark output without running the intended kernel, by recovering the child-process result-authentication secret from memory and writing attacker-controlled results into the inherited result channel before the real benchmark loop completes. |
| 8 | + |
| 9 | +This document is intentionally disclosure-oriented: |
| 10 | + |
| 11 | +- it does **not** include exploit payloads |
| 12 | +- it does **not** include step-by-step reproduction code |
| 13 | +- it focuses on the issue, impact, and remediation options |
| 14 | + |
| 15 | +## Main Finding |
| 16 | + |
| 17 | +The current isolated benchmarking path still relies on a secret known to the worker process itself. |
| 18 | + |
| 19 | +At a high level: |
| 20 | + |
| 21 | +1. the parent process creates a benchmark subprocess |
| 22 | +2. a result-authentication secret is delivered to the child |
| 23 | +3. the child stores that secret in process memory |
| 24 | +4. untrusted Python submission code runs in that same process |
| 25 | +5. the child still has access to the result channel used to report benchmark results |
| 26 | + |
| 27 | +That means the worker can potentially: |
| 28 | + |
| 29 | +- recover the secret from its own address space |
| 30 | +- emit forged benchmark results carrying the correct authentication material |
| 31 | +- cause the parent to accept attacker-controlled timings |
| 32 | + |
| 33 | +In other words, the integrity of the result channel still depends on trusting code inside the process being benchmarked. |
| 34 | + |
| 35 | +## Why This Matters |
| 36 | + |
| 37 | +If the child process can authenticate arbitrary forged results, then: |
| 38 | + |
| 39 | +- reported timings no longer prove that the benchmarked kernel actually ran |
| 40 | +- reported error counts no longer prove that correctness checks actually passed |
| 41 | +- benchmark output can be made arbitrarily small or otherwise attacker-controlled |
| 42 | + |
| 43 | +This is a benchmark-integrity failure, not just a local implementation bug. |
| 44 | + |
| 45 | +## Additional Weaknesses Observed |
| 46 | + |
| 47 | +The signature/authentication issue is the primary concern. Several secondary findings make exploitation easier or increase future risk: |
| 48 | + |
| 49 | +### 1. GC-visible benchmark metadata |
| 50 | + |
| 51 | +At import time, Python-visible objects can still reveal useful benchmark structure such as: |
| 52 | + |
| 53 | +- number of repeats |
| 54 | +- output tensor metadata |
| 55 | +- tolerance information |
| 56 | + |
| 57 | +Even if tensor payloads are protected better than before, metadata leakage reduces attacker uncertainty. |
| 58 | + |
| 59 | +### 2. Warmup predictability |
| 60 | + |
| 61 | +If warmup always uses a deterministic case or stable pointer pattern, a malicious kernel may distinguish warmup from measured iterations and adapt its behavior accordingly. |
| 62 | + |
| 63 | +### 3. NaN wildcard handling |
| 64 | + |
| 65 | +Any checker behavior that treats NaN in expected data as “accept anything” is dangerous. Even if not immediately exploitable through the current path, it creates a latent bypass if expected-output addresses or copies become observable later. |
| 66 | + |
| 67 | +### 4. Overly broad in-process capability |
| 68 | + |
| 69 | +Untrusted Python code still runs with: |
| 70 | + |
| 71 | +- arbitrary `ctypes` access |
| 72 | +- process-memory visibility |
| 73 | +- inherited file descriptors / pipes |
| 74 | +- normal Python runtime introspection |
| 75 | + |
| 76 | +That combination is enough to make “secret inside the same process” a weak design. |
| 77 | + |
| 78 | +## What `pygpubench` Already Improves |
| 79 | + |
| 80 | +This report should not obscure the fact that `pygpubench` already fixes important problems that affect naive Python evaluators. |
| 81 | + |
| 82 | +Compared to an in-process pure-Python benchmark harness, `pygpubench` materially improves resistance against: |
| 83 | + |
| 84 | +- Python monkeypatching of timer objects |
| 85 | +- direct patching of Python reference/evaluator functions |
| 86 | +- trivial caching of user-visible tensors |
| 87 | +- some stream-ordering and L2-cache based reward hacking |
| 88 | + |
| 89 | +So the right framing is: |
| 90 | + |
| 91 | +- the architecture is **better** |
| 92 | +- but the remaining secret/result-channel design is still not strong enough for adversarial benchmarking |
| 93 | + |
| 94 | +## Root Cause |
| 95 | + |
| 96 | +The benchmark subprocess is simultaneously: |
| 97 | + |
| 98 | +- the environment running untrusted code |
| 99 | +- the holder of trusted result-authentication state |
| 100 | + |
| 101 | +As long as the child both: |
| 102 | + |
| 103 | +1. possesses the authentication material, and |
| 104 | +2. can write to the channel accepted by the parent, |
| 105 | + |
| 106 | +the scheme is vulnerable in principle. |
| 107 | + |
| 108 | +The security model still assumes the worker can be trusted with some benchmark-control state. In an adversarial benchmark, it cannot. |
| 109 | + |
| 110 | +## Recommended Fixes |
| 111 | + |
| 112 | +## 1. Do not keep the authentication secret alive inside untrusted execution |
| 113 | + |
| 114 | +Any key, signature, token, or HMAC material used to authenticate results should not remain recoverable after untrusted Python code starts executing. |
| 115 | + |
| 116 | +At minimum: |
| 117 | + |
| 118 | +- generate the key in trusted code |
| 119 | +- consume it before importing the submission |
| 120 | +- explicitly overwrite it in the child |
| 121 | + |
| 122 | +This reduces the straightforward memory-recovery path. |
| 123 | + |
| 124 | +## 2. Move result authentication to a trusted boundary |
| 125 | + |
| 126 | +A stronger fix is to ensure that the worker process never has the ability to authenticate arbitrary forged results. |
| 127 | + |
| 128 | +Two good directions: |
| 129 | + |
| 130 | +- trusted validator/orchestrator process owns the authentication |
| 131 | +- worker emits only raw events/results, never authenticated final records |
| 132 | + |
| 133 | +The parent should authenticate data that the worker cannot forge by construction. |
| 134 | + |
| 135 | +## 3. Reduce in-process attack surface |
| 136 | + |
| 137 | +Possible hardening measures: |
| 138 | + |
| 139 | +- restrict inherited file descriptors |
| 140 | +- tighten seccomp / syscall policy where practical |
| 141 | +- minimize procfs visibility where practical |
| 142 | +- reduce or remove unnecessary writable/redirectable channels in the worker |
| 143 | + |
| 144 | +These are secondary mitigations, not substitutes for a correct trust boundary. |
| 145 | + |
| 146 | +## 4. Remove or minimize metadata leakage |
| 147 | + |
| 148 | +Before importing the submission: |
| 149 | + |
| 150 | +- drop Python references that are no longer needed |
| 151 | +- clean up transient objects that reveal benchmark layout |
| 152 | +- avoid keeping expected-output metadata visible through ordinary Python object traversal |
| 153 | + |
| 154 | +## 5. Randomize warmup and pre-measurement behavior |
| 155 | + |
| 156 | +Avoid deterministic warmup patterns that allow the kernel to distinguish: |
| 157 | + |
| 158 | +- warmup |
| 159 | +- estimated timing |
| 160 | +- real benchmark passes |
| 161 | + |
| 162 | +## 6. Fail closed on suspicious checker states |
| 163 | + |
| 164 | +Examples: |
| 165 | + |
| 166 | +- NaN in expected outputs should be treated as a benchmark-generation failure, not a wildcard |
| 167 | +- malformed or incomplete child output should fail hard |
| 168 | +- any mismatch in result structure should fail hard |
| 169 | + |
| 170 | +## Suggested Architectural Direction |
| 171 | + |
| 172 | +### Option A: Trusted validator split |
| 173 | + |
| 174 | +Use a three-role model: |
| 175 | + |
| 176 | +- orchestrator: trusted, owns benchmark policy and result acceptance |
| 177 | +- worker: untrusted, only runs the kernel |
| 178 | +- validator: trusted, checks correctness and/or authenticates the final result |
| 179 | + |
| 180 | +The worker should not be able to independently produce a parent-acceptable final benchmark record. |
| 181 | + |
| 182 | +### Option B: Transitional hardening |
| 183 | + |
| 184 | +If a full redesign is not immediately feasible: |
| 185 | + |
| 186 | +1. remove recoverable result-authentication state before user import |
| 187 | +2. aggressively reduce inherited descriptors/capabilities |
| 188 | +3. clear Python-visible benchmark metadata before import |
| 189 | +4. add tamper detection for result-channel anomalies |
| 190 | + |
| 191 | +This would still be weaker than a proper split-process trust model, but materially better than the current design. |
| 192 | + |
| 193 | +## Recommended Next Steps |
| 194 | + |
| 195 | +1. treat this as a security bug affecting adversarial benchmark integrity |
| 196 | +2. review the result-authentication path end-to-end |
| 197 | +3. patch the child secret lifetime / ownership problem first |
| 198 | +4. then follow up with capability reduction and metadata cleanup |
| 199 | +5. publish a brief security note once the fix lands |
| 200 | + |
| 201 | +## Scope Of This Document |
| 202 | + |
| 203 | +This file is intended to support remediation planning inside the repository. |
| 204 | + |
| 205 | +It does **not**: |
| 206 | + |
| 207 | +- provide exploit code |
| 208 | +- attribute methods to any third party |
| 209 | +- claim that every theoretical path has been weaponized |
| 210 | + |
| 211 | +It only records that the current result-authentication design remains vulnerable because trusted benchmark state is still exposed inside the untrusted worker process. |
0 commit comments