NVIDIA
diff --git a/‎examples/hstu/inference/README.md‎
Lines changed: 9 additions & 25 deletions b/‎examples/hstu/inference/README.md‎
Lines changed: 9 additions & 25 deletions
diff --git a/‎examples/hstu/inference/benchmark/README.md‎
Lines changed: 3 additions & 2 deletions b/‎examples/hstu/inference/benchmark/README.md‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎examples/hstu/inference/benchmark/inference_benchmark.py‎
Lines changed: 7 additions & 7 deletions b/‎examples/hstu/inference/benchmark/inference_benchmark.py‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎examples/hstu/inference/benchmark/inference_benchmark_l20.png‎
31.6 KB b/‎examples/hstu/inference/benchmark/inference_benchmark_l20.png‎
31.6 KB
@@ -2,9 +2,9 @@
 
 ## Key Features
 
-1. Cache for KV data
+1. Asynchronous Cache Manager for KV data
 
-We use GPU memory and host storage for KV data cache., as in `GpuKVCacheManager` and `HostKVStorageManager`. This can help to reduce the recomputation of KV data.
+We use GPU memory and host storage for KV data cache as in `AsyncKVCacheManager`. This can help to reduce the recomputation of KV data. All the kvcache related operations are implemented as asynchronous, in order to hide the overhead with inference computation.
 
 The GPU KV cache is organized as a paged KV-data table, and supports KV data adding/appending, lookup and eviction. When appending new data to the GPU cache, we will evict data from the oldest users according to the LRU policy if there is no empty page. The HSTU attention kernel also accepts KV data from a paged table.
 
@@ -33,33 +33,16 @@ The dense module is served as one instance per GPU, and the KV cache is not supp
 ### KVCache Usage
 
 1. KVCache Manager supports the following operations:
-* `get_user_kvdata_info`: to get current cached length and index of the first cached tokens in the history sequence
-* `prepare_kv_cache`: to allocate the required cache pages. The input history sequence need to be 
-* `paged_kvcache_ops.append_kvcache`: the cuda kernel to copy the `K, V` values into the allocated cache pages
-* `offload_kv_cache`: to offload the KV data from GPU KVCache to Host KV storage.
+* `prepare_kvcache_async`: to trigger the allocation for required KV cache pages, kvcache_metadata computation, and onload the KV data from Host KV storage to GPU KVCache in background.
+* `prepare_kvcache_wait`: to wait the new KV cache pages allocation and kvcache_metadata computation.
+* `paged_kvcache_ops.append_kvcache`: the cuda kernel to copy the `K, V` values into the allocated cache pages.
+* `offload_kvcache`: to trigger offloading the KV data from GPU KVCache to Host KV storage in background.
 * `evict_kv_cache`: to evict all the KV data in the KVCache Manager.
 
-2. Currently, the KVCache manager need to be access from a single thread.
+2. Currently, the KVCache manager need to be access from a single inference stream. No multi-stream support.
 
-3. For different requests, the call to `get_user_kvdata_info` and `prepare_kv_cache` need to be in order and cannot be interleaved. Since the allocation in `prepare_kv_cache` may evict the cached data of other users, which changes the user kvdata_info.
+3. The KVCache manager accepts full user history sequence as input. The removal of cached tokens in sequences is completed within inference forward pass.
 
-4. The KVCache manager does not support uncontinuous user history sequence as input from the same user. The overlapping tokens need to be removed before sending the sequence to the inference model. Doing the overrlapping removal in the upstream stage should be more performant than in the inference model.
-
-```
-[current KV data in cache] userId: 0, starting position: 0, cached length: 10
-[next input] {userId: 0, starting position: 10, length: 10}
-# Acceptable input
-
-[current KV data in cache] userId: 0, starting position: 0, cached length: 10
-[next input] {userId: 0, starting position: 20, length: 10}
-                         ^^^^^^^^^^^^^^^^^^^^^
-ERROR: The input sequence has missing tokens from 10 to 19 (both inclusive).
-
-[current KV data in cache] userId: 0, starting position: 0, cached length: 10
-[next input] {userId: 0, starting position: 5, length: 20}
-                         ^^^^^^^^^^^^^^^^^^^^^
-ERROR: The input sequence has overlapping tokens from 5 to 9 (both inclusive).
-```
 
 ## How to Setup
 
@@ -71,6 +54,7 @@ Turn on option `INFERENCEBUILD=1` to skip Megatron installation, which is not re
 ~$ cd ${WORKING_DIR}
 ~$ git clone --recursive -b ${TEST_BRANCH} ${TEST_REPO} recsys-examples && cd recsys-examples
 ~$ docker build \
+    --platform linux/amd64 \
     --build-arg INFERENCEBUILD=1 \
     -t recsys-examples:inference \
     -f docker/Dockerfile .
 
@@ -36,7 +36,7 @@ Here we benchmarked with a synthetic input dataset:
 * Each input request has 256 item candidates for ranking.
 * Generate data for 1, 2, 4 and 8 users to benchmark with different batch size. 
 
-We can achieve **1.4x ~ 2.7x** performance speedup for inference (with batch size ranging from 1 to 8), after utilizing the KV cache and CUDA graph optimization.
+We can achieve **1.3x ~ 2.6x** performance speedup for inference (with batch size ranging from 1 to 8), after utilizing the KV cache and CUDA graph optimization.
 
 Performance results:
 
@@ -46,7 +46,8 @@ Note:
 
 1. The baseline performance is based on our implementation without KVCache support and CUDA Graph optimization.
 2. The end-to-end performance includes the embedding part, which utilizes both native `EmbeddingCollection` from TorchRec and `DynamicEmbedding`.
-3. The number of input sequences from the synthetic dataset increases according to the batch size.
+3. The number of input sequences from the synthetic dataset increases according to the batch size. All test cases have 16 batches in total.
+4. In the test cases with KVCache enabled, the kvcache preparation and onloading/offloading are within time measurement, but they are hidden as asynchronous operations. 
 
 ### 2. HSTU block performance
 
 
@@ -111,7 +111,7 @@ def run_ranking_gr_inference(disable_kvcache: bool):
             hstu_config=hstu_config,
             kvcache_config=kv_cache_config,
             task_config=task_config,
-            use_cudagraph=False,  # True,
+            use_cudagraph=True,
             cudagraph_configs=hstu_cudagraph_configs,
         )
         model_predict.bfloat16()
@@ -122,8 +122,8 @@ def run_ranking_gr_inference(disable_kvcache: bool):
             item_feature_name=item_fea_name,
             contextual_feature_names=[],
             action_feature_name=action_fea_name,
-            max_num_users=8,
-            max_batch_size=8,  # test batch size
+            max_num_users=1,
+            max_batch_size=1,  # test batch size
             max_history_length=max_num_history,
             max_num_candidates=max_num_candidates,
             max_incremental_seqlen=max_incremental_seqlen,
@@ -133,18 +133,18 @@ def run_ranking_gr_inference(disable_kvcache: bool):
 
         dataloader = get_data_loader(dataset)
 
-        num_warmup = 16
-        for idx in range(num_warmup):
-            pass
+        # Warm up
+        for batch, user_ids, total_history_lengths in dataloader:
+            model_predict.forward_nokvcache(batch)
 
+        dataloader = get_data_loader(dataset)
         ts_start, ts_end = [torch.cuda.Event(enable_timing=True) for _ in range(2)]
         ts_start.record()
         for batch, user_ids, total_history_lengths in dataloader:
             if not disable_kvcache:
                 model_predict.forward(batch, user_ids, total_history_lengths)
             else:
                 model_predict.forward_nokvcache(batch)
-
         ts_end.record()
         predict_time = ts_start.elapsed_time(ts_end)
         print("Total time(ms):", predict_time)