Skip to content

Commit 1f51ce1

Browse files
committed
Update GitHub pages to v1.2.0rc0.post1
1 parent 4ab3889 commit 1f51ce1

File tree

475 files changed

+346855
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

475 files changed

+346855
-0
lines changed

1.2.0rc0.post1/.buildinfo

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Sphinx build info version 1
2+
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: 477e315bbc607e47d292888b705f631a
4+
tags: 645f666f9bcd5a90fca523b33c5a78b7

1.2.0rc0.post1/.nojekyll

Whitespace-only changes.

1.2.0rc0.post1/_cpp_gen/executor.html

Lines changed: 13772 additions & 0 deletions
Large diffs are not rendered by default.

1.2.0rc0.post1/_cpp_gen/runtime.html

Lines changed: 14773 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
### :title KV Cache Offloading
2+
### :order 6
3+
### :section Customization
4+
'''
5+
This script demonstrates the effectiveness of KV cache host offloading in TensorRT-LLM.
6+
7+
**Scenario:**
8+
The script simulates a scenario where the GPU's KV cache is severely limited,
9+
while multiple requests with recurring prompts (like system prompts) are processed.
10+
11+
1. **Constrained GPU Cache:** The GPU KV cache is configured to be very small,
12+
only large enough to hold the state for a single request.
13+
2. **Alternating Prompts:** Four requests are sent sequentially (batch size of 1)
14+
with two distinct prompts in an A, B, A, B pattern.
15+
3. **Cache Eviction:** Due to the small GPU cache, processing prompt B will
16+
force the eviction of the cache generated for prompt A.
17+
18+
**Demonstration:**
19+
20+
* **Without Offloading (Default):**
21+
- When the first prompt 'A' is processed, its KV cache is stored on the GPU.
22+
- When prompt 'B' arrives, the cache manager needs space and discards the cache for 'A'.
23+
- When prompt 'A' is sent again, its cache must be recomputed from scratch.
24+
- **Expected Outcome:** The log will show `reused blocks: 0` and `cache hit rate: 0`.
25+
26+
* **With Offloading (`--enable_offloading`):**
27+
- When prompt 'B' arrives, the cache for 'A' is not discarded but is instead
28+
*offloaded* from the fast GPU VRAM to the slower (but larger) host CPU RAM.
29+
- When prompt 'A' is sent again, its KV cache is loaded back from host RAM
30+
to the GPU, which is significantly faster than recomputing it.
31+
- **Expected Outcome:** The log will show positive values for `reused blocks`
32+
and a non-zero `cache hit rate`, confirming that the cache was successfully
33+
reused from the host.
34+
35+
**How to Run & Verify:**
36+
37+
1. **Without Offloading:**
38+
```bash
39+
TLLM_LOG_LEVEL=DEBUG python llm_kv_cache_offloading.py 2>&1 | tee offloading_disabled.log
40+
```
41+
(Check the log for zero reuse)
42+
43+
2. **With Offloading:**
44+
```bash
45+
TLLM_LOG_LEVEL=DEBUG python llm_kv_cache_offloading.py --enable_offloading 2>&1 | tee offloading_enabled.log
46+
```
47+
(Check the log for non-zero reuse)
48+
'''
49+
50+
import argparse
51+
52+
from tensorrt_llm import LLM, SamplingParams
53+
from tensorrt_llm.llmapi import KvCacheConfig
54+
55+
56+
def main(args):
57+
# Define two distinct prompts to simulate different requests or system prompts.
58+
prompt_a = (
59+
"Returns the per-iterations statistics computed since last call to this method. "
60+
"Contains at most iter_stats_max_iterations iterations.")
61+
prompt_b = ("Use for skipping decoding step for non generation model, "
62+
"and return the batch_output (such as mm_embeddings)")
63+
64+
# Use a batch size of 1 to process requests sequentially, making the cache
65+
# eviction and reuse cycle easy to observe.
66+
max_batch_size = 1
67+
max_seq_len = 256
68+
69+
# --- KV Cache Configuration ---
70+
# Set a small GPU KV cache size (in number of tokens). This is crucial for the demo,
71+
# as it's only large enough to hold the KV cache for a single request.
72+
kv_cache_max_tokens = 256
73+
# Define the size of a single cache block.
74+
kv_cache_page_size = 16
75+
# Enable a 1 GB host cache if offloading is requested, otherwise disable it (size 0).
76+
# This is the key toggle for the experiment.
77+
kv_cache_host_size = 1024**3 if args.enable_offloading else 0
78+
79+
sampling_params = SamplingParams(max_tokens=max_seq_len)
80+
81+
llm = LLM(
82+
model="Qwen/Qwen3-8B",
83+
max_batch_size=max_batch_size,
84+
max_seq_len=max_seq_len,
85+
kv_cache_config=KvCacheConfig(
86+
enable_block_reuse=True, # Enable reuse of cached blocks
87+
max_tokens=kv_cache_max_tokens, # Max tokens in GPU cache
88+
tokens_per_block=kv_cache_page_size,
89+
host_cache_size=kv_cache_host_size # Host cache size for offloading
90+
))
91+
92+
# Process four requests sequentially using two distinct prompts (A, B, A, B).
93+
# This pattern is designed to showcase the cache eviction and reuse behavior.
94+
print("--- First Round ---")
95+
# 1. Process prompt A. Its cache is stored on the GPU.
96+
output_a = llm.generate(prompt_a, sampling_params)
97+
print(
98+
f"Prompt: {output_a.prompt!r}, Generated text: {output_a.outputs[0].text!r}"
99+
)
100+
# 2. Process prompt B. Its cache replaces/offloads A's cache.
101+
output_b = llm.generate(prompt_b, sampling_params)
102+
print(
103+
f"Prompt: {output_b.prompt!r}, Generated text: {output_b.outputs[0].text!r}"
104+
)
105+
106+
print("\n--- Second Round ---")
107+
# 3. Process prompt A again.
108+
# - Without offloading: Must recompute from scratch.
109+
# - With offloading: Recovers cache from host RAM.
110+
output_a = llm.generate(prompt_a, sampling_params)
111+
print(
112+
f"Prompt: {output_a.prompt!r}, Generated text: {output_a.outputs[0].text!r}"
113+
)
114+
# 4. Process prompt B again.
115+
# - Without offloading: Must recompute from scratch.
116+
# - With offloading: Recovers cache from host RAM.
117+
output_b = llm.generate(prompt_b, sampling_params)
118+
print(
119+
f"Prompt: {output_b.prompt!r}, Generated text: {output_b.outputs[0].text!r}"
120+
)
121+
122+
llm.shutdown()
123+
124+
125+
if __name__ == "__main__":
126+
parser = argparse.ArgumentParser(
127+
description=
128+
"A script to demonstrate the effectiveness of KV cache host offloading."
129+
)
130+
parser.add_argument('--enable_offloading',
131+
action='store_true',
132+
help='Enable host RAM for KV cache offloading.')
133+
args = parser.parse_args()
134+
main(args)

0 commit comments

Comments
 (0)