HSTU KV Cache Manager V2 by geoffreyQiu · Pull Request #251 · NVIDIA/recsys-examples

geoffreyQiu · 2025-12-08T07:17:49Z

Implement HSTU KVCacheManager V2:

Asynchronous kvcache manager operations
Optimized onloading and offloading

greptile-apps · 2026-01-27T01:31:58Z

Greptile Overview

Greptile Summary

This PR implements HSTU KVCacheManager V2 with asynchronous operations, moving Python-based KV cache management to optimized C++ implementations with compression support.

Key Changes

New C++ Implementation: Added kvcache_manager_impl.cpp/h with GPUKVCacheMangerImpl, HostKVStorageImpl, KVCompressor, and synchronization handles
Async Operations: Implemented parallel onload and metadata preparation using Python ThreadPoolExecutor with per-layer synchronization
Compression Support: Integrated nvcomp library for optional KV cache compression during offload operations
Performance Optimizations: Moved num_sms calculation out of kernel calls, added multi-threaded memcpy, and implemented all-layers gather kernel
API Changes: Replaced Python GPUKVCacheManager and HostKVStorageManager with C++ implementations exposed via pybind11 with GIL release

Issues Found

Memory Leak: host_kv_ptr allocated with aligned_alloc at line 945 in kvcache_manager_impl.cpp is never freed, causing memory leak on each offload operation

Architecture

The new design uses a producer-consumer pattern with three main threads: metadata preparation, onload worker, and offload worker. Each layer synchronizes independently using CUDA events and condition variables, allowing inference to proceed as soon as each layer's cache is ready.

Confidence Score: 3/5

This PR is moderately safe to merge but requires fixing the memory leak before deployment
Score of 3 reflects a significant memory leak issue that will cause memory exhaustion over time. The leak occurs in the offload loop where host_kv_ptr is allocated with aligned_alloc but never freed. All previously reported issues from review threads have been addressed. The async architecture is sound and well-implemented with proper synchronization, but the memory leak is a critical issue that must be fixed.
Pay close attention to examples/commons/ops/cuda_ops/csrc/kvcache_manager_impl.cpp - the memory leak at line 945 must be fixed by adding free(host_kv_ptr) before line 1098

Important Files Changed

Filename	Overview
examples/commons/ops/cuda_ops/csrc/kvcache_manager_impl.cpp	New C++ implementation for async KV cache manager with compression, threading, and memory management. Memory leak found with `aligned_alloc`.
examples/commons/ops/cuda_ops/csrc/kvcache_manager_impl.h	Header file defining data structures for async KV cache manager including compressor, handles, and storage implementations.
examples/commons/ops/cuda_ops/csrc/paged_kvcache_ops_cuda.cpp	Updated Python bindings with GIL release for async operations and exposed new C++ classes for KV cache management.
examples/commons/ops/cuda_ops/csrc/paged_kvcache_ops_kernel.cu	Optimized CUDA kernels by passing `num_sms` as parameter and added new all-layers gather kernel for batch operations.
examples/hstu/modules/async_kvcache_manager.py	New Python async KV cache manager using ThreadPoolExecutors for parallel onload/metadata preparation operations.
examples/hstu/modules/paged_hstu_infer_layer.py	Updated to use new async KV cache manager with wait_host synchronization for per-layer onload operations.
examples/hstu/modules/inference_dense_module.py	Refactored to use new async KV cache manager API with separate prepare and wait phases for better parallelism.

Sequence Diagram

sequenceDiagram
    participant Client as Python Client
    participant AsyncMgr as AsyncHSTUKVCacheManager
    participant MetadataThread as Metadata Thread
    participant OnloadThread as Onload Thread
    participant GPUMgr as GPUKVCacheMangerImpl (C++)
    participant OffloadThread as Offload Thread
    participant HostStorage as HostKVStorageImpl (C++)
    
    Client->>AsyncMgr: prepare_kvcache_async()
    AsyncMgr->>MetadataThread: submit(prepare_kvcache)
    AsyncMgr->>OnloadThread: submit(onload_kvcache)
    AsyncMgr->>Client: return futures
    
    par Parallel Execution
        MetadataThread->>GPUMgr: prepare_kvcache()
        GPUMgr->>GPUMgr: alloc pages
        GPUMgr->>GPUMgr: create metadata
        MetadataThread->>AsyncMgr: complete
    and
        OnloadThread->>HostStorage: get_kvdata()
        OnloadThread->>GPUMgr: decompress + D2H transfer
        OnloadThread->>GPUMgr: complete_host(layer_idx)
        OnloadThread->>AsyncMgr: complete
    end
    
    Client->>AsyncMgr: prepare_kvcache_wait()
    AsyncMgr->>MetadataThread: metadata_fut.result()
    AsyncMgr->>Client: return KVCacheMetadata
    
    Client->>Client: forward through layers
    loop For each layer
        Client->>GPUMgr: wait_host(layer_idx)
        Note over GPUMgr: Wait for onload completion
        Client->>Client: process layer with KV cache
        Client->>GPUMgr: mark_ready(layer_idx)
    end
    
    Client->>AsyncMgr: offload_kvcache()
    AsyncMgr->>GPUMgr: offload_kvcache()
    GPUMgr->>OffloadThread: queue offload task
    
    par Async Offload
        loop For each layer
            OffloadThread->>OffloadThread: wait for mark_ready()
            OffloadThread->>GPUMgr: gather pages from GPU
            OffloadThread->>GPUMgr: compress (if enabled)
            OffloadThread->>HostStorage: append_kvdata()
        end
    end

greptile-apps

_{14 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

examples/hstu/modules/async_kvcache_manager.py

examples/hstu/ops/cuda_ops/csrc/paged_kvcache_ops_cuda.cpp

examples/hstu/inference/async_kvcache_eval.py

geoffreyQiu · 2026-02-11T02:06:17Z

@greptileai

greptile-apps

_{24 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}