nerdy-tech-com-gitub
diff --git a/‎.github/scripts/spellcheck_conf/wordlist.txt
Lines changed: 7 additions & 0 deletions b/‎.github/scripts/spellcheck_conf/wordlist.txt
Lines changed: 7 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/README.md
Lines changed: 50 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/README.md
Lines changed: 50 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/data/summarization/cnn_dailymail.jsonl
Lines changed: 1000 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/data/summarization/cnn_dailymail.jsonl
Lines changed: 1000 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/data/summarization/xsum.jsonl
Lines changed: 1000 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/data/summarization/xsum.jsonl
Lines changed: 1000 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/requirements.txt
Lines changed: 4 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/requirements.txt
Lines changed: 4 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/run_streaming.py
Lines changed: 91 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/run_streaming.py
Lines changed: 91 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/run_summarization.py
Lines changed: 147 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/run_summarization.py
Lines changed: 147 additions & 0 deletions
diff --git a/‎recipes/experimental/long-context/H2O/src/streaming.sh
Lines changed: 23 additions & 0 deletions b/‎recipes/experimental/long-context/H2O/src/streaming.sh
Lines changed: 23 additions & 0 deletions
@@ -1351,6 +1351,13 @@ Weaviate
 MediaGen
 SDXL
 SVD
+KV
+KVs
+XSUM
+contrains
+knowlege
+kv
+prefilling
 DataFrame
 DuckDB
 Groq
 
@@ -0,0 +1,50 @@
+## Run Llama with H2O for long context inference
+
+### Overview:
+
+Heavy-Hitter Oracle (H2O) is an efficient inference framework of LLMs. During the generative inference of transfomers, the size of KV cache grows linearly with the sequence length (prompt length + generation length) during long context generation. And the size KV cache is usually significantly larger than the model parameters, contrains the inference throughput. H2O identifies the critical KV pairs and evicts other unnecessary ones, maintaining a small cache size thus improving the throughput.
+
+Besides, LLMs usually have poor generation to long sequence during inference. H2O handles this issue by maintaining only heavy-hitter tokens and the most recent tokens. Incorporated with the positional rolling strategy (reassigning the position of each kv with the position in the kv cache instead of the original sequence), H2O can process sequence length much longer than the pretrained context window. Different from other approaches, like [Positional Interpolation](https://arxiv.org/abs/2306.15595), H2O is a KV cache policy and do not involve any training process for long context processing.
+
+Current implementation supports llama-1/2/3, from 7B to 70B. Since H2O only maintains the most important KV pairs, it might missing some important information in the middle content for some knowlege-intensive tasks.
+
+More details please refer to Paper: **https://arxiv.org/pdf/2306.14048**; Blog: **https://allenz.work/?p=11**.
+
+**Note: this implementation is tested with transformers == 4.39.0**
+
+### Evaluation on Summarization Tasks
+
+The following example runs inference of Llama-2-7b and Meta-Llama-3-8B on XSUM summarization tasks. We're using `--enable_h2o_generation` to enable H2O algorithm that only keeps heavy-hitter and the local KV pairs. Use `--num_window_length `to decide the KV cache size. The number of local and heavy-hitter KV pairs equals to half of the --num_window_length (Option: the number of heavy-hitters can also be decided by `--num_heavy_hitter_tokens`) Also, use --enable_position_rolling to enable position rolling in the KV cache size that assign the positions in the KV cache instead of the ones in original sequences. Enabling positional rolling is important when sequence length exceeds the pretrained context windows, e.g., 8K in Llama-3.
+
+```
+python run_summarization.py \
+--input-path data/summarization/xsum.jsonl \
+--output-path summarization_output/xsum_h2o.jsonl \
+--model-name meta-llama/Meta-Llama-3-8B \
+--enable_h2o_generation 
+```
+
+##### **Results**
+
+Expected results on XSUM (Rouge-2 score, the higher the better) from the above scripts on Llama-2/3 models. The sequence length of inputs are ~2k. Here we constrains the size of KV cache, allowing only n KVs to be write/read after the prefilling stage. n ranges from **64** to **full** where we maintain all the KV pairs. With 128 KVs, the performance can be matched as the full baseline (~2k KVs) while performance degradation is observed with 64 KVs. Also, maintaining a smaller KV cache reduces the I/O cost of KVs, thus we can achieve better throughput.
+
+| KV Cache Size | 64     | 128    | 256    | 512    | 1024   | Full   |
+| ------------- | ------ | ------ | ------ | ------ | ------ | ------ |
+| Llama-2-7B    | 0.0439 | 0.1127 | 0.1148 | 0.1182 | 0.1170 | 0.1164 |
+| Llama-2-13B   | 0.1180 | 0.1217 | 0.1243 | 0.1291 | 0.1302 | 0.1332 |
+| Llama-3-8B    | 0.1107 | 0.1189 | 0.1200 | 0.1347 | 0.1290 | 0.1311 |
+
+### One Demo on Streaming to "Infinite" Context Length
+
+The following example demonstrates the generation process of "infinite" sequence length. We use MT-Bench data and generate the context sample-by-sample. The KV Cache will keep the KV pairs from the previous samples while maintain a fixed size. Results can be found on [Demo](https://allenz.work/?p=11) (Video 1).
+
+```
+# run with full cache
+# expected results: 1) normal generation at the early stage; 2) performance collapse and generation slow down at the middle stage, because the sequence length exceeds the context window and the I/O cost of KV cache contrains the throughput; 3) OOM errors and stop.
+bash src/streaming.sh full
+
+# run with h2o
+# expected results: normal generation at all stage.
+# adjust the number of heavy-hitter tokens with --num_heavy_hitter_tokens and size of KV cache with --num_window_length in src/streaming.sh
+bash src/streaming.sh h2o
+```
@@ -0,0 +1,4 @@
+transformers
+rouge
+xopen
+needlehaystack
@@ -0,0 +1,91 @@
+import torch
+import argparse
+import json
+import os
+import time
+import re
+import sys
+
+from utils.streaming import load, download_url, load_jsonl, greedy_generate
+
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
+from utils.llama import H2OLlamaForCausalLM
+from utils.cache import Cache, HHCache, StaticCache
+
+
+@torch.no_grad()
+def streaming_inference_h2o(model, tokenizer, config, prompts, max_gen_len=1000, enable_h2o_generation=False):
+    past_key_values = None
+    for idx, prompt in enumerate(prompts):
+        prompt = "USER: " + prompt + "\n\nASSISTANT: "
+        print("\n" + prompt, end="")
+        input_ids = tokenizer(prompt, return_tensors="pt").input_ids
+        input_ids = input_ids.to(model.device)
+        seq_len = input_ids.shape[1]
+
+        past_key_values = greedy_generate(
+            model, tokenizer, input_ids, past_key_values, max_gen_len=max_gen_len
+        )
+        if enable_h2o_generation:
+            space_needed = seq_len + max_gen_len
+            past_key_values = HHCache.from_legacy_cache(config.num_window_length, config.num_heavy_hitter_tokens, past_key_values)
+            past_key_values.evict_for_space(space_needed)
+            past_key_values = past_key_values.to_legacy_cache()
+
+
+def main():
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--input-path", type=str, default="")
+    parser.add_argument("--model-name", type=str, default="lmsys/vicuna-13b-v1.5")
+
+    parser.add_argument("--enable_h2o_generation", action='store_true')
+    parser.add_argument("--num_heavy_hitter_tokens", type=int, default=128)
+    parser.add_argument("--num_window_length", type=int, default=256)
+
+    parser.add_argument("--enable_position_rolling", action='store_true')
+
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+
+    args = parser.parse_args()
+
+    model_name = args.model_name
+    data_root = args.input_path
+
+    config = AutoConfig.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
+
+    if args.enable_h2o_generation:
+        config.num_heavy_hitter_tokens = args.num_heavy_hitter_tokens
+        config.num_window_length = args.num_window_length
+        config.enable_position_rolling = args.enable_position_rolling
+        model = H2OLlamaForCausalLM.from_pretrained(model_name,
+            torch_dtype=torch.float16,
+            device_map='auto',
+            low_cpu_mem_usage=True,
+            config=config)
+    else:
+        model = AutoModelForCausalLM.from_pretrained(model_name,
+            torch_dtype=torch.float16,
+            device_map='auto',
+            low_cpu_mem_usage=True,)
+
+    test_filepath = os.path.join(data_root, "mt_bench.jsonl")
+    print(f"Loading data from {test_filepath} ...")
+
+    if not os.path.exists(test_filepath):
+        download_url(
+            "https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl",
+            data_root,
+        )
+        os.rename(os.path.join(data_root, "question.jsonl"), test_filepath)
+
+    list_data = load_jsonl(test_filepath)
+    prompts = []
+    for sample in list_data:
+        prompts += sample["turns"]
+
+    streaming_inference_h2o(model, tokenizer, config, prompts, enable_h2o_generation=args.enable_h2o_generation)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,147 @@
+import os
+import tqdm
+import json
+import copy
+import math
+
+import torch
+import logging
+import argparse
+
+import numpy as np
+from rouge import Rouge
+
+import dataclasses
+from xopen import xopen
+
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
+from utils.llama import H2OLlamaForCausalLM
+
+def set_seed(args):
+    np.random.seed(args.seed)
+    torch.manual_seed(args.seed)
+    torch.cuda.manual_seed_all(args.seed)
+
+if __name__ == '__main__':
+
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("--input-path", type=str, default="")
+    parser.add_argument("--output-path", type=str, default="")
+
+    parser.add_argument("--model-name", type=str, default="")
+
+    parser.add_argument("--enable_h2o_generation", action='store_true')
+    parser.add_argument("--num_heavy_hitter_tokens", type=int, default=-1)
+    parser.add_argument("--num_window_length", type=int, default=256)
+
+    parser.add_argument("--enable_position_rolling", action='store_true')
+
+    parser.add_argument("--sample_num", type=int, default=500)
+    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")
+
+    args = parser.parse_args()
+
+    set_seed(args)
+
+    model_name = args.model_name
+    input_path = args.input_path
+    output_path = args.output_path
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+
+    config = AutoConfig.from_pretrained(model_name)
+    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
+    if args.num_heavy_hitter_tokens == -1:
+        print('not assign number of heavy hitter tokens, use half of the cache size: {}'.format(args.num_window_length // 2))
+        args.num_heavy_hitter_tokens = args.num_window_length // 2
+
+    if args.enable_h2o_generation:
+        config.num_heavy_hitter_tokens = args.num_heavy_hitter_tokens
+        config.num_window_length = args.num_window_length
+        config.enable_position_rolling = args.enable_position_rolling
+        model = H2OLlamaForCausalLM.from_pretrained(model_name,
+            torch_dtype=torch.float16,
+            device_map='auto',
+            low_cpu_mem_usage=True,
+            config=config)
+    else:
+        model = AutoModelForCausalLM.from_pretrained(model_name,
+            torch_dtype=torch.float16,
+            device_map='auto',
+            low_cpu_mem_usage=True,)
+
+    # loading inference data
+    requests = []
+    with open(input_path, 'r') as f:
+        for line in f:
+            if line.strip() != '':
+                requests.append(json.loads(line))
+
+    if args.sample_num < len(requests):
+        print('Sample {} Examples from {} samples'.format(args.sample_num, len(requests)))
+    requests = requests[:args.sample_num]
+
+    results = []
+    rouge = Rouge()
+    rouge1_score_list = []
+    rouge2_score_list = []
+    rougel_score_list = []
+
+    with torch.no_grad():
+        for request in tqdm.tqdm(requests):
+            result = {'request': request, 'result': {}}
+            prompt = request['article']
+            label = request['summary_gt']
+            temperature = request['temperature']
+            stop = request['stop']
+
+            input_ids = tokenizer(prompt, add_special_tokens=False, return_tensors='pt').input_ids.to(model.device)
+
+            output_sequences = model.generate(
+                input_ids=input_ids,
+                max_length=request['max_tokens'] + len(input_ids[0]),
+                temperature=temperature,
+                top_p=request['top_p'],
+                do_sample=True,
+                num_return_sequences=request['n'],
+                return_dict_in_generate=True, output_scores=True,
+                pad_token_id=tokenizer.eos_token_id
+            )
+
+            tokens = tokenizer.convert_ids_to_tokens(output_sequences['sequences'].squeeze(0))[len(input_ids[0]):]
+            logprobs = [logits.log_softmax(dim=-1).max().item() for logits in output_sequences['scores']]
+            top_logprobs = [{i: v for i, v in zip(tokens, logprobs)}]
+
+            generate_text = tokenizer.decode(output_sequences['sequences'].squeeze(0)[len(input_ids[0]):])
+            generate_text = generate_text[: generate_text.find(stop[0])]
+
+            scores = rouge.get_scores(generate_text, label)[0]
+            rouge1_score_list.append(scores['rouge-1']['f'])
+            rouge2_score_list.append(scores['rouge-2']['f'])
+            rougel_score_list.append(scores['rouge-l']['f'])
+
+            result['result'] = {
+                "choices": [
+                    {
+                        "text": generate_text,
+                        "logprobs": {
+                            "tokens": tokens, 
+                            "token_logprobs": logprobs, 
+                            "top_logprobs": top_logprobs, 
+                            "text_offset": []
+                        }, 
+                        "finish_reason": "length"
+                    }
+                ], 
+                "request_time": {
+                    "batch_time": 0, 
+                    "batch_size": 1}
+            }
+            
+            results.append(result)
+
+    print('Average Rouge1: {:.6f}, Rouge-2: {:.6f}, Rouge-l: {:.6f}'.format(np.mean(rouge1_score_list), np.mean(rouge2_score_list), np.mean(rougel_score_list)))
+    with open(output_path, 'w') as f:
+        for result in results:
+            f.write(json.dumps(result) + '\n')
+
@@ -0,0 +1,23 @@
+method=$1
+if [[ ${method} == 'h2o' ]]; then
+    python -u run_streaming.py \
+        --input-path data \
+        --model-name lmsys/vicuna-13b-v1.5 \
+        --enable_h2o_generation \
+        --num_heavy_hitter_tokens 2048 \
+        --num_window_length 4096 \
+        --enable_position_rolling
+elif [[ ${method} == 'full' ]]; then
+    python -u run_streaming.py \
+        --input-path data \
+        --model-name lmsys/vicuna-13b-v1.5
+else
+    echo 'unknown argment for method'
+fi
+
+
+
+
+
+
+
-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +transformers
 +rouge
 +xopen
 +needlehaystack