Skip to content

Commit 63d0cba

Browse files
committed
Update documents and benchmark results.
1 parent 922c159 commit 63d0cba

File tree

6 files changed

+19
-719
lines changed

6 files changed

+19
-719
lines changed

examples/hstu/inference/README.md

Lines changed: 9 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## Key Features
44

5-
1. Cache for KV data
5+
1. Asynchronous Cache Manager for KV data
66

7-
We use GPU memory and host storage for KV data cache., as in `GpuKVCacheManager` and `HostKVStorageManager`. This can help to reduce the recomputation of KV data.
7+
We use GPU memory and host storage for KV data cache as in `AsyncKVCacheManager`. This can help to reduce the recomputation of KV data. All the kvcache related operations are implemented as asynchronous, in order to hide the overhead with inference computation.
88

99
The GPU KV cache is organized as a paged KV-data table, and supports KV data adding/appending, lookup and eviction. When appending new data to the GPU cache, we will evict data from the oldest users according to the LRU policy if there is no empty page. The HSTU attention kernel also accepts KV data from a paged table.
1010

@@ -33,33 +33,16 @@ The dense module is served as one instance per GPU, and the KV cache is not supp
3333
### KVCache Usage
3434

3535
1. KVCache Manager supports the following operations:
36-
* `get_user_kvdata_info`: to get current cached length and index of the first cached tokens in the history sequence
37-
* `prepare_kv_cache`: to allocate the required cache pages. The input history sequence need to be
38-
* `paged_kvcache_ops.append_kvcache`: the cuda kernel to copy the `K, V` values into the allocated cache pages
39-
* `offload_kv_cache`: to offload the KV data from GPU KVCache to Host KV storage.
36+
* `prepare_kvcache_async`: to trigger the allocation for required KV cache pages, kvcache_metadata computation, and onload the KV data from Host KV storage to GPU KVCache in background.
37+
* `prepare_kvcache_wait`: to wait the new KV cache pages allocation and kvcache_metadata computation.
38+
* `paged_kvcache_ops.append_kvcache`: the cuda kernel to copy the `K, V` values into the allocated cache pages.
39+
* `offload_kvcache`: to trigger offloading the KV data from GPU KVCache to Host KV storage in background.
4040
* `evict_kv_cache`: to evict all the KV data in the KVCache Manager.
4141

42-
2. Currently, the KVCache manager need to be access from a single thread.
42+
2. Currently, the KVCache manager need to be access from a single inference stream. No multi-stream support.
4343

44-
3. For different requests, the call to `get_user_kvdata_info` and `prepare_kv_cache` need to be in order and cannot be interleaved. Since the allocation in `prepare_kv_cache` may evict the cached data of other users, which changes the user kvdata_info.
44+
3. The KVCache manager accepts full user history sequence as input. The removal of cached tokens in sequences is completed within inference forward pass.
4545

46-
4. The KVCache manager does not support uncontinuous user history sequence as input from the same user. The overlapping tokens need to be removed before sending the sequence to the inference model. Doing the overrlapping removal in the upstream stage should be more performant than in the inference model.
47-
48-
```
49-
[current KV data in cache] userId: 0, starting position: 0, cached length: 10
50-
[next input] {userId: 0, starting position: 10, length: 10}
51-
# Acceptable input
52-
53-
[current KV data in cache] userId: 0, starting position: 0, cached length: 10
54-
[next input] {userId: 0, starting position: 20, length: 10}
55-
^^^^^^^^^^^^^^^^^^^^^
56-
ERROR: The input sequence has missing tokens from 10 to 19 (both inclusive).
57-
58-
[current KV data in cache] userId: 0, starting position: 0, cached length: 10
59-
[next input] {userId: 0, starting position: 5, length: 20}
60-
^^^^^^^^^^^^^^^^^^^^^
61-
ERROR: The input sequence has overlapping tokens from 5 to 9 (both inclusive).
62-
```
6346

6447
## How to Setup
6548

@@ -71,6 +54,7 @@ Turn on option `INFERENCEBUILD=1` to skip Megatron installation, which is not re
7154
~$ cd ${WORKING_DIR}
7255
~$ git clone --recursive -b ${TEST_BRANCH} ${TEST_REPO} recsys-examples && cd recsys-examples
7356
~$ docker build \
57+
--platform linux/amd64 \
7458
--build-arg INFERENCEBUILD=1 \
7559
-t recsys-examples:inference \
7660
-f docker/Dockerfile .

examples/hstu/inference/benchmark/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Here we benchmarked with a synthetic input dataset:
3636
* Each input request has 256 item candidates for ranking.
3737
* Generate data for 1, 2, 4 and 8 users to benchmark with different batch size.
3838

39-
We can achieve **1.4x ~ 2.7x** performance speedup for inference (with batch size ranging from 1 to 8), after utilizing the KV cache and CUDA graph optimization.
39+
We can achieve **1.3x ~ 2.6x** performance speedup for inference (with batch size ranging from 1 to 8), after utilizing the KV cache and CUDA graph optimization.
4040

4141
Performance results:
4242

@@ -46,7 +46,8 @@ Note:
4646

4747
1. The baseline performance is based on our implementation without KVCache support and CUDA Graph optimization.
4848
2. The end-to-end performance includes the embedding part, which utilizes both native `EmbeddingCollection` from TorchRec and `DynamicEmbedding`.
49-
3. The number of input sequences from the synthetic dataset increases according to the batch size.
49+
3. The number of input sequences from the synthetic dataset increases according to the batch size. All test cases have 16 batches in total.
50+
4. In the test cases with KVCache enabled, the kvcache preparation and onloading/offloading are within time measurement, but they are hidden as asynchronous operations.
5051

5152
### 2. HSTU block performance
5253

examples/hstu/inference/benchmark/inference_benchmark.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ def run_ranking_gr_inference(disable_kvcache: bool):
111111
hstu_config=hstu_config,
112112
kvcache_config=kv_cache_config,
113113
task_config=task_config,
114-
use_cudagraph=False, # True,
114+
use_cudagraph=True,
115115
cudagraph_configs=hstu_cudagraph_configs,
116116
)
117117
model_predict.bfloat16()
@@ -122,8 +122,8 @@ def run_ranking_gr_inference(disable_kvcache: bool):
122122
item_feature_name=item_fea_name,
123123
contextual_feature_names=[],
124124
action_feature_name=action_fea_name,
125-
max_num_users=8,
126-
max_batch_size=8, # test batch size
125+
max_num_users=1,
126+
max_batch_size=1, # test batch size
127127
max_history_length=max_num_history,
128128
max_num_candidates=max_num_candidates,
129129
max_incremental_seqlen=max_incremental_seqlen,
@@ -133,18 +133,18 @@ def run_ranking_gr_inference(disable_kvcache: bool):
133133

134134
dataloader = get_data_loader(dataset)
135135

136-
num_warmup = 16
137-
for idx in range(num_warmup):
138-
pass
136+
# Warm up
137+
for batch, user_ids, total_history_lengths in dataloader:
138+
model_predict.forward_nokvcache(batch)
139139

140+
dataloader = get_data_loader(dataset)
140141
ts_start, ts_end = [torch.cuda.Event(enable_timing=True) for _ in range(2)]
141142
ts_start.record()
142143
for batch, user_ids, total_history_lengths in dataloader:
143144
if not disable_kvcache:
144145
model_predict.forward(batch, user_ids, total_history_lengths)
145146
else:
146147
model_predict.forward_nokvcache(batch)
147-
148148
ts_end.record()
149149
predict_time = ts_start.elapsed_time(ts_end)
150150
print("Total time(ms):", predict_time)
31.6 KB
Loading

0 commit comments

Comments
 (0)