Skip to content

Commit 492eac7

Browse files
committed
test
1 parent ec8842f commit 492eac7

File tree

2 files changed

+4
-3
lines changed

2 files changed

+4
-3
lines changed

recipes/experimental/long-context/H2O/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
### Overview:
44

5-
Heavy-Hitter Oracle (H2O) is an efficient inference framework of LLMs. During the generative inference of transfomers, the size of KV cache grows linearly with the sequence length (prompt length + generation length) during long context generation. And the size KV cache is usually significantly larger than the model parameters, contrains the inference throughput. H2O identifies the critical KV pairs and evicts other unnecessary ones, maintaining a small cache size thus improving the throughput.
5+
Heavy-Hitter Oracle (H2O) is an efficient inference framework of LLMs. During the generative inference of transfomers, the size of KV cache grows linearly with the sequence length (prompt length + generation length) during long context generation. And the size KV cache is usually significantly larger than the model parameters, contrains the inference throughput. H2O identifies the critical KV pairs and evicts other unnecessary ones, maintaining a small cache size thus improving the throughput.
66

7-
Besides, LLMs usually have poor generation to long sequence during inference. H2O handles this issue by maintaining only heavy-hitter tokens and the most recent tokens. Incorporated with the positional rolling strategy (reassigning the position of each kv with the position in the kv cache instead of the original sequence), H2O can process sequence length much longer than the pretrained context window.
7+
Besides, LLMs usually have poor generation to long sequence during inference. H2O handles this issue by maintaining only heavy-hitter tokens and the most recent tokens. Incorporated with the positional rolling strategy (reassigning the position of each kv with the position in the kv cache instead of the original sequence), H2O can process sequence length much longer than the pretrained context window.
88

99
Current implementation supports llama-1/2/3, from 7B to 70B. Since H2O only maintains the most important KV pairs, it might missing some important information in the middle content for some knowlege-intensive tasks.
1010

@@ -43,6 +43,7 @@ The following example runs inference of Llama-3-8b-instruct on "Needle in a hays
4343
```
4444
# step 1: obtain prompts for evaluation
4545
python utils/needle_test/prompt.py --model_name meta-llama/Meta-Llama-3-8B-Instruct
46+
# modify utils/needle_test/config-prompt.yaml to adjust the min/max sequence length for the test
4647
4748
4849
# step 2: generation predictions of each prompt

recipes/experimental/long-context/H2O/utils/needle_test/config-prompt.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ prompt:
55

66
context:
77
min_len: 1000
8-
max_len: 8000
8+
max_len: 16000
99
interval: 10
1010
manually_select_list: null # null or a list of context lengths to manually select
1111

0 commit comments

Comments
 (0)