Skip to content

ArgPack use-after-free: LaunchContextBuilder stores raw pointer, GC can free during kernel launch #8788

@chkxw

Description

@chkxw

Summary

LaunchContextBuilder::argpack_ptrs stores a raw const ArgPack * pointer without any ownership semantics. On the CUDA/LLVM backend, Program::delete_argpack() always deletes immediately because LlvmProgramImpl does not override used_in_kernel() (base class returns false). If Python's garbage collector frees an ArgPack wrapper between set_arg_argpack() and the actual kernel launch, the CUDA kernel launcher dereferences a dangling pointer — causing wild writes and memory corruption.

This bug causes random SEGV / heap corruption in any workload that uses ArgPack with the CUDA backend, especially under GC pressure (many kernel launches, multi-threaded).

Root Cause

1. Raw pointer storage without ownership

// launch_context_builder.h:137-140
std::unordered_map<std::vector<int>,
                   const ArgPack *,  // RAW POINTER — no ref counting
                   hashing::Hasher<std::vector<int>>>
    argpack_ptrs;

// launch_context_builder.cpp:252-254
void LaunchContextBuilder::set_arg_argpack(const std::vector<int> &arg_id,
                                           const ArgPack &argpack) {
  argpack_ptrs[arg_id] = &argpack;  // stores address, no ownership

2. Dangling pointer dereference during kernel launch

// cuda/kernel_launcher.cpp:127-138
auto *argpack = ctx.argpack_ptrs[key];           // dangling if GC'd
auto argpack_ptr = argpack->get_device_allocation();  // wild read
// ...
auto *argpack_parent = ctx.argpack_ptrs[key_parent];
argpack_parent->set_arg_nested_argpack_ptr(       // wild WRITE
    key.back(), (uint64)device_ptrs[data_ptr_idx]);

Same pattern in cpu/kernel_launcher.cpp:48-63, amdgpu/kernel_launcher.cpp:82-94, and gfx/runtime.cpp:486-490.

3. used_in_kernel() guard is unimplemented on LLVM backends

// program_impl.h:103-105 (base class)
virtual bool used_in_kernel(DeviceAllocationId) {
    return false;  // ALWAYS FALSE on CUDA/CPU/AMDGPU
}

Only GfxProgramImpl overrides this (gfx_program.h:47-48). LlvmProgramImpl inherits the base class, so delete_argpack() (program.cpp:428-443) always deletes immediately on CUDA/CPU/AMDGPU backends, regardless of pending kernel launches.

4. Python GC triggers immediate C++ destruction

# argpack.py:76-78
def __del__(self):
    if impl is not None and impl.get_runtime() is not None and impl.get_runtime().prog is not None:
        impl.get_runtime().prog.delete_argpack(self.__argpack)

The race

  1. Python calls set_arg_argpack() → raw &argpack stored in argpack_ptrs
  2. Python GC runs (e.g., triggered by allocation pressure in another thread)
  3. GC collects ArgPack Python wrapper → __del__delete_argpack()used_in_kernel() returns false → C++ ArgPack freed
  4. Kernel launcher dereferences dangling argpack_ptrs[key]wild write / SEGV

Design inconsistency

set_arg_ndarray() follows a safe pattern — it copies the data pointer as an integer:

// launch_context_builder.cpp:246
intptr_t ptr = arr.get_device_allocation_ptr_as_int();  // copies VALUE, safe

set_arg_argpack() deviates from this by storing an object pointer (&argpack), which is unsafe.

Evidence

Observed in a multi-threaded MARL training workload using Genesis physics simulation with Taichi CUDA backend (2048 parallel envs). 12 independent crashes over weeks of debugging:

  • Random SEGV in Python GC (tp_traverse NULL, object type confusion: rangecode, dictbool)
  • ASan (LD_PRELOAD) reported zero heap UAF in 19 hours — because the free happens inside Taichi's device allocator, not via system free()
  • ASan + PYTHONMALLOC=malloc: ASan's own allocator metadata was corrupted by a wild write (CHECK failed: rz_size=0x0)
  • Crash vfx missing in 2d examples #12: faulthandler showed "Garbage-collecting" during taichi.kernel_impl.launch_kernel — direct evidence of GC during kernel launch
  • Valgrind (serializes all threads): no crash in 10+ hours — consistent with a threading race
  • More envs = faster crash (2048 envs: ~5 min, 1024 envs: ~30-90 min)

Genesis issue Genesis-Embodied-AI/Genesis#492 appears to be the same bug (segfault during scene.step() after 1000+ iterations, closed without root cause).

Workaround

Add the ArgPack Python object to the tmps GC-prevention list in kernel_impl.py, following the existing pattern used for numpy arrays (line 731: tmps.append(tmp) # Purpose: DO NOT GC |tmp|!):

# kernel_impl.py, inside recursive_set_args(), after set_arg_argpack():
launch_ctx.set_arg_argpack(indices, v._ArgPack__argpack)
tmps.append(v)  # prevent GC of ArgPack while C++ holds raw pointer

Suggested Fix

Option A (minimal): Apply the Python-side workaround above.

Option B (proper): Change argpack_ptrs to store DeviceAllocation by value instead of a raw object pointer, matching the safe pattern used by set_arg_ndarray(). This requires updating all kernel launchers. The nested argpack write-back (set_arg_nested_argpack_ptr) would need refactoring.

Option C (defense-in-depth): Implement used_in_kernel() in LlvmProgramImpl to actually track in-flight allocations, matching the existing GFX backend implementation.

Additional note: array_ptrs has the same pattern

set_arg_ndarray() stores (void *)&ndarray_alloc_ (address of a member inside the Ndarray object) in array_ptrs. If the Ndarray is GC'd, the member address becomes invalid. The CUDA kernel launcher dereferences this at cuda/kernel_launcher.cpp:110-112:

DeviceAllocation *ptr = static_cast<DeviceAllocation *>(data_ptr);
device_ptrs[data_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);

Same vulnerability, though Ndarrays tend to be long-lived so it's less likely to trigger in practice.

Environment

  • Taichi 1.7.4 (commit b4b956f)
  • Python 3.12, CUDA 12.8, Linux 6.17
  • NVIDIA RTX 5090

Affected code (introduced in July 2023)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Untriaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions