Skip to content

HSTU KV Cache Manager V2#251

Merged
shijieliu merged 14 commits intoNVIDIA:mainfrom
geoffreyQiu:kvcache
Feb 13, 2026
Merged

HSTU KV Cache Manager V2#251
shijieliu merged 14 commits intoNVIDIA:mainfrom
geoffreyQiu:kvcache

Conversation

@geoffreyQiu
Copy link
Collaborator

Implement HSTU KVCacheManager V2:

  • Asynchronous kvcache manager operations
  • Optimized onloading and offloading

@greptile-apps
Copy link

greptile-apps bot commented Jan 27, 2026

Greptile Overview

Greptile Summary

This PR implements HSTU KVCacheManager V2 with asynchronous operations, moving Python-based KV cache management to optimized C++ implementations with compression support.

Key Changes

  • New C++ Implementation: Added kvcache_manager_impl.cpp/h with GPUKVCacheMangerImpl, HostKVStorageImpl, KVCompressor, and synchronization handles
  • Async Operations: Implemented parallel onload and metadata preparation using Python ThreadPoolExecutor with per-layer synchronization
  • Compression Support: Integrated nvcomp library for optional KV cache compression during offload operations
  • Performance Optimizations: Moved num_sms calculation out of kernel calls, added multi-threaded memcpy, and implemented all-layers gather kernel
  • API Changes: Replaced Python GPUKVCacheManager and HostKVStorageManager with C++ implementations exposed via pybind11 with GIL release

Issues Found

  • Memory Leak: host_kv_ptr allocated with aligned_alloc at line 945 in kvcache_manager_impl.cpp is never freed, causing memory leak on each offload operation

Architecture

The new design uses a producer-consumer pattern with three main threads: metadata preparation, onload worker, and offload worker. Each layer synchronizes independently using CUDA events and condition variables, allowing inference to proceed as soon as each layer's cache is ready.

Confidence Score: 3/5

  • This PR is moderately safe to merge but requires fixing the memory leak before deployment
  • Score of 3 reflects a significant memory leak issue that will cause memory exhaustion over time. The leak occurs in the offload loop where host_kv_ptr is allocated with aligned_alloc but never freed. All previously reported issues from review threads have been addressed. The async architecture is sound and well-implemented with proper synchronization, but the memory leak is a critical issue that must be fixed.
  • Pay close attention to examples/commons/ops/cuda_ops/csrc/kvcache_manager_impl.cpp - the memory leak at line 945 must be fixed by adding free(host_kv_ptr) before line 1098

Important Files Changed

Filename Overview
examples/commons/ops/cuda_ops/csrc/kvcache_manager_impl.cpp New C++ implementation for async KV cache manager with compression, threading, and memory management. Memory leak found with aligned_alloc.
examples/commons/ops/cuda_ops/csrc/kvcache_manager_impl.h Header file defining data structures for async KV cache manager including compressor, handles, and storage implementations.
examples/commons/ops/cuda_ops/csrc/paged_kvcache_ops_cuda.cpp Updated Python bindings with GIL release for async operations and exposed new C++ classes for KV cache management.
examples/commons/ops/cuda_ops/csrc/paged_kvcache_ops_kernel.cu Optimized CUDA kernels by passing num_sms as parameter and added new all-layers gather kernel for batch operations.
examples/hstu/modules/async_kvcache_manager.py New Python async KV cache manager using ThreadPoolExecutors for parallel onload/metadata preparation operations.
examples/hstu/modules/paged_hstu_infer_layer.py Updated to use new async KV cache manager with wait_host synchronization for per-layer onload operations.
examples/hstu/modules/inference_dense_module.py Refactored to use new async KV cache manager API with separate prepare and wait phases for better parallelism.

Sequence Diagram

sequenceDiagram
    participant Client as Python Client
    participant AsyncMgr as AsyncHSTUKVCacheManager
    participant MetadataThread as Metadata Thread
    participant OnloadThread as Onload Thread
    participant GPUMgr as GPUKVCacheMangerImpl (C++)
    participant OffloadThread as Offload Thread
    participant HostStorage as HostKVStorageImpl (C++)
    
    Client->>AsyncMgr: prepare_kvcache_async()
    AsyncMgr->>MetadataThread: submit(prepare_kvcache)
    AsyncMgr->>OnloadThread: submit(onload_kvcache)
    AsyncMgr->>Client: return futures
    
    par Parallel Execution
        MetadataThread->>GPUMgr: prepare_kvcache()
        GPUMgr->>GPUMgr: alloc pages
        GPUMgr->>GPUMgr: create metadata
        MetadataThread->>AsyncMgr: complete
    and
        OnloadThread->>HostStorage: get_kvdata()
        OnloadThread->>GPUMgr: decompress + D2H transfer
        OnloadThread->>GPUMgr: complete_host(layer_idx)
        OnloadThread->>AsyncMgr: complete
    end
    
    Client->>AsyncMgr: prepare_kvcache_wait()
    AsyncMgr->>MetadataThread: metadata_fut.result()
    AsyncMgr->>Client: return KVCacheMetadata
    
    Client->>Client: forward through layers
    loop For each layer
        Client->>GPUMgr: wait_host(layer_idx)
        Note over GPUMgr: Wait for onload completion
        Client->>Client: process layer with KV cache
        Client->>GPUMgr: mark_ready(layer_idx)
    end
    
    Client->>AsyncMgr: offload_kvcache()
    AsyncMgr->>GPUMgr: offload_kvcache()
    GPUMgr->>OffloadThread: queue offload task
    
    par Async Offload
        loop For each layer
            OffloadThread->>OffloadThread: wait for mark_ready()
            OffloadThread->>GPUMgr: gather pages from GPU
            OffloadThread->>GPUMgr: compress (if enabled)
            OffloadThread->>HostStorage: append_kvdata()
        end
    end
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@geoffreyQiu geoffreyQiu force-pushed the kvcache branch 2 times, most recently from 17b117f to ab48756 Compare February 5, 2026 16:49
@geoffreyQiu
Copy link
Collaborator Author

@greptileai

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

@geoffreyQiu geoffreyQiu marked this pull request as ready for review February 12, 2026 05:11
@geoffreyQiu geoffreyQiu changed the title [WIP] HSTU KV Cache Manager V2 HSTU KV Cache Manager V2 Feb 12, 2026
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

29 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@geoffreyQiu
Copy link
Collaborator Author

CI passed.

@shijieliu shijieliu merged commit fd2c33e into NVIDIA:main Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants