NVIDIA
diff --git a/‎docs/source/advanced/speculative-decoding.md‎
Lines changed: 6 additions & 6 deletions b/‎docs/source/advanced/speculative-decoding.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎examples/llm-api/README.md‎
Lines changed: 5 additions & 4 deletions b/‎examples/llm-api/README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎examples/prompt_lookup/README.md‎ ‎examples/ngram/README.md‎examples/prompt_lookup/README.md renamed to examples/ngram/README.md
Lines changed: 30 additions & 28 deletions b/‎examples/prompt_lookup/README.md‎ ‎examples/ngram/README.md‎examples/prompt_lookup/README.md renamed to examples/ngram/README.md
Lines changed: 30 additions & 28 deletions
diff --git a/‎examples/prompt_lookup/requirements.txt‎ ‎examples/ngram/requirements.txt‎examples/prompt_lookup/requirements.txt renamed to examples/ngram/requirements.txt b/‎examples/prompt_lookup/requirements.txt‎ ‎examples/ngram/requirements.txt‎examples/prompt_lookup/requirements.txt renamed to examples/ngram/requirements.txt
diff --git a/‎examples/prompt_lookup/run_dtm_pld.py‎ ‎examples/ngram/run_dtm_ngram.py‎examples/prompt_lookup/run_dtm_pld.py renamed to examples/ngram/run_dtm_ngram.py
Lines changed: 34 additions & 36 deletions b/‎examples/prompt_lookup/run_dtm_pld.py‎ ‎examples/ngram/run_dtm_ngram.py‎examples/prompt_lookup/run_dtm_pld.py renamed to examples/ngram/run_dtm_ngram.py
Lines changed: 34 additions & 36 deletions
@@ -3,7 +3,7 @@
 - [About Speculative Sampling](#about-speculative-sampling)
 - [Performance Improvements](#Performance-improvements)
 - [Draft-Target-Model](#Draft-Target-Model)
-- [Prompt-Lookup-Decoding](#prompt-lookup-decoding)
+- [NGram](#ngram)
 - [Medusa](#medusa)
   - [Medusa Tree](#medusa-tree)
   - [Using Medusa with TensorRT-LLM](#using-medusa-with-tensorrt-llm)
@@ -36,7 +36,7 @@ TensorRT-LLM supports several approaches for generating draft tokens, including:
     1. [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads paper](https://arxiv.org/abs/2401.10774).
     2. [Recurrent Drafter for Fast Speculative Decoding in Large Language Models](https://arxiv.org/html/2403.09919v1).
     3. [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/pdf/2401.15077).
-3. Utilizing prompt tokens as draft tokens. For more information, refer to [Prompt Lookup Decoding](https://github.com/apoorvumang/prompt-lookup-decoding/).
+3. Utilizing prompt tokens as draft tokens. For more information, refer to [NGram](https://github.com/apoorvumang/prompt-lookup-decoding/).
 4. Utilizing Jacobi-like decoding to predict and verify draft tokens using the same model which does not need additional fine-tuning. Refer to [Break the Sequential Dependency of LLM Inference Using Lookahead Decoding](https://arxiv.org/pdf/2402.02057).
 
 
@@ -62,13 +62,13 @@ Subsequently, the prompt, now updated with the accepted tokens, is sent back to
 This iterative process continues until a predefined stop conditions are met.
 An example of this orchestration process can be found in the [TensorRT-LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/inflight_batcher_llm/client/e2e_grpc_speculative_decoding_client.py).
 
-We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py).
+We provide two styles of running Draft-Target-Model now: using TensorRT-LLM-BLS in Triton Inference Server, or using TensorRT-LLM directly. Detailed steps of running can be found in [examples/draft_target_model/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/draft_target_model/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).
 
-## Prompt-Lookup-Decoding
+## NGram
 
-The Prompt-Lookup speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The Prompt-Lookup profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.
+The NGram speculative decoding directly copies from the input prompt and previous generated output as draft tokens while generating the later output. It works like Draft-Target-Model but involves only one Target LLM model without further fine-tuning. The NGram profit from the scenarios which have high n-gram overlap between input prompt and output, such as summarization, document QA, multi-turn chat, code editing, etc.
 
-See document in [examples/prompt_lookup/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/README.md) and the code can be found in [examples/prompt_lookup/run_dtm_pld.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/prompt_lookup/run_dtm_pld.py).
+See document in [examples/ngram/README.md](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/README.md) and the code can be found in [examples/ngram/run_dtm_ngram.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/ngram/run_dtm_ngram.py).
 
 ## Medusa
 
 
@@ -40,9 +40,10 @@ python3 quickstart_multimodal.py --model_dir Efficient-Large-Model/NVILA-8B --mo
 python3 quickstart_advanced.py \
     --model_dir meta-llama/Llama-3.1-8B-Instruct \
     --spec_decode_algo NGRAM \
-    --max_matching_ngram_size=2 \
-    --spec_decode_nextn=4 \
-    --disable_overlap_scheduler
+    --spec_decode_nextn 4 \
+    --max_matching_ngram_size 2 \
+    --disable_overlap_scheduler \
+    --disable_kv_cache_reuse
 ```
 
 ```bash
@@ -52,6 +53,6 @@ python3 quickstart_advanced.py \
     --spec_decode_algo draft_target \
     --spec_decode_nextn 5 \
     --draft_model_dir meta-llama/Llama-3.2-1B-Instruct \
-    --disable_overlap_scheduler
+    --disable_overlap_scheduler \
     --disable_kv_cache_reuse
 ```
@@ -1,17 +1,17 @@
-# Prompt-Lookup Speculative Decoding
+# NGram Speculative Decoding
 
-This document shows how to build and run a model using Prompt-Lookup speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.
+This document shows how to build and run a model using NGram speculative decoding (supported as `ASSISTED_GENERATION` in transformers and vLLM, source: [GitHub](https://github.com/apoorvumang/prompt-lookup-decoding/tree/main)) in TensorRT-LLM on single GPU, or single node multiple GPU.
 
 ## Overview
 
-We provide two styles of workflow to run Prompt-Lookup (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling `runner.generate()` multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling `runner.generate()` only one time to get outputs, which provides higher performance but fixed process.
+We provide two styles of workflow to run NGram (named V1 and V2 respectively) now. V1 is in TRT workflow and similar to the Draft-Target-Model workflow, running in orchestrator mode and calling `runner.generate()` multiple times to get outputs, which is more flexible for customizing but slightly more overhead. V2 is in pytorch workflow and similar to the Look-Ahead workflow, running in leader mode and calling `runner.generate()` only one time to get outputs, which provides higher performance but fixed process.
 
-The Prompt-Lookup has 3 additional hyperparameters that you need to specify to control the process of generation:
-- `prompt_lookup_num_tokens`: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
+The NGram has 3 additional hyperparameters that you need to specify to control the process of generation:
+- `max_draft_len`: the maximum number of tokens provided as draft tokens in one iteration, which is usually from 4 to 10 in common usage (default value: 4). Empirically, the larger the value is, the higher acceptance rate but higher overhead is expected at the same time, so the right balance based on the models and application scenarios needs to be found.
 - `max_matching_ngram_size`: the maximum number of tokens extracted from the tail of the input prompt or generated output as a pattern, which is used to search corresponding draft tokens (default value: 2). Empirically, the larger the value is, the more precise context can be matched from the existed sequence, indicating higher acceptance rate, but the higher probability of miss-match and higher overhead appear, which fall back to normal generation (one token per iteration).
 - `device_list`: the index list of device(s) to run the model in V1 workflow. The length of it must be the same as the TP size of the draft model engine. For instances, `device_list=[0]` means using tp_size=1 and GPU 0 for the model, `device_list=[4,5,6,7]` means using tp=4 and GPU from 4 to 7 for the model. This parameter is neddless in V2 workflow.
 
-+ For example, the process of getting draft tokens using `prompt_lookup_num_tokens=2` and `max_matching_ngram_size=4` with a sentence `prefix=[..., t1, t2, t3, t4]` is like below:
++ For example, the process of getting draft tokens using `max_draft_len=2` and `max_matching_ngram_size=4` with a sentence `prefix=[..., t1, t2, t3, t4]` is like below:
 
 ```Python
 pattern = prefix[:-2]                               # pattern=[t3, t4] (length=2)
@@ -40,39 +40,39 @@ return None                                         # No any candidate exists
 + We use an open-source `llama-v2-13B` models in this example.
 + `--use_paged_context_fmha=enable` must be specified since we need KVcache reuse in this approach.
 + `--speculative_decoding_mode=draft_tokens_external` must be specified.
-+ `--max_draft_len` must be specified larger or equal to `prompt_lookup_num_tokens`.
-+ `---prompt_lookup_config` is corresponding configuration of Prompt-Lookup, we can see its usage in [util.py](../util.py).
-  + As an example, `[10,2,[0]]` means `prompt_lookup_num_tokens=10`, `max_matching_ngram_size=2`, and device of target model is `GPU0`.
++ `--max_draft_len` must be specified as the length maximum of the draft tokens.
++ `--ngram_config` is corresponding configuration of NGram, we can see its usage in [util.py](../util.py).
+  + As an example, `[10,2,[0]]` means `max_draft_len=10`, `max_matching_ngram_size=2`, and device of target model is `GPU0`.
 + `--kv_cache_enable_block_reuse` must be specified for this approach.
 + Only CPP session is supported, so `--use_py_session` must not be specified.
 + `--num_beams` can not be specified as larger than 1 since beam search is not supported in this approach yet.
 
 ```bash
 # Build engine
 python3 examples/models/core/llama/convert_checkpoint.py \
-    --model_dir=<Path To Llama-v2-13B repo> \
-    --output_dir=./ckpt-target \
-    --dtype=float16
+    --model_dir <Path To Llama-v2-13B repo> \
+    --output_dir ./ckpt-target \
+    --dtype float16
 
 trtllm-build \
-    --checkpoint_dir=./ckpt-target \
-    --output_dir=./target-engine \
-    --gemm_plugin=float16 \
-    --use_paged_context_fmha=enable \
-    --speculative_decoding_mode=draft_tokens_external \
-    --max_draft_len=10 \
-    --max_batch_size=4 \
-    --max_input_len=3200 \
-    --max_seq_len=4800
+    --checkpoint_dir ./ckpt-target \
+    --output_dir ./target-engine \
+    --gemm_plugin float16 \
+    --use_paged_context_fmha enable \
+    --speculative_decoding_mode draft_tokens_external \
+    --max_draft_len 10 \
+    --max_batch_size 4 \
+    --max_input_len 3200 \
+    --max_seq_len 4800
 
 # Run decoding
 python3 examples/run.py \
     --tokenizer_dir <Path To Llama-v2-7B repo> \
     --engine_dir ./target-engine \
-    --prompt_lookup_config="[10,2,[0]]" \
-    --max_output_len=256 \
+    --ngram_config "[10,2,[0]]" \
+    --max_output_len 256 \
     --kv_cache_enable_block_reuse \
-    --input_text="How does Draft-Sampling work?"
+    --input_text "How does Draft-Sampling work?"
 
 # Run summarization tasks
 python examples/summarize.py \
@@ -81,15 +81,17 @@ python examples/summarize.py \
     --check_accuracy \
     --hf_model_dir <Path To Llama-v2-7B repo> \
     --engine_dir ./target-engine \
-    --batch_size=1 \
-    --prompt_lookup_config="[10,2,[0]]" \
+    --batch_size 1 \
+    --ngram_config "[10,2,[0]]" \
     --kv_cache_enable_block_reuse
 ```
 
 ### V2 workflow
 
 ```bash
 python3 examples/llm-api/quickstart_advanced.py \
-    --max_matching_ngram_size=2 \
-    --spec_decode_nextn=4
+    --spec_decode_nextn 4 \
+    --max_matching_ngram_size 2 \
+    --disable_overlap_scheduler \
+    --disable_kv_cache_reuse
 ```
@@ -23,20 +23,20 @@
 from tensorrt_llm.runtime import ModelRunnerCpp
 
 
-class PLDPool:  # Ngrams pool for Prompt-Lookup-Decoding
+class NgramPool:  # Ngrams pool for Ngram
 
     def __init__(
         self,
         input_batch_size: int,
-        prompt_lookup_num_tokens: int,
+        max_draft_len: int,
         max_matching_ngram_size: int,
         end_id: int,
         max_seq_len: list[int],
         is_keep_all: bool = True,
         is_use_oldest: bool = True,
     ):
         self.input_batch_size = input_batch_size
-        self.prompt_lookup_num_tokens = prompt_lookup_num_tokens
+        self.max_draft_len = max_draft_len
         self.max_matching_ngram_size = max_matching_ngram_size
         self.end_id = end_id
         self.max_seq_len = max_seq_len
@@ -45,7 +45,7 @@ def __init__(
         self.pool = [{} for _ in range(input_batch_size)]
         self.start_index = [0 for _ in range(input_batch_size)]
 
-        assert self.prompt_lookup_num_tokens > 0, f"prompt_lookup_num_tokens must be greater than 0, but got {self.prompt_lookup_num_tokens}"
+        assert self.max_draft_len > 0, f"max_draft_len must be greater than 0, but got {self.max_draft_len}"
         assert self.max_matching_ngram_size > 0, f"max_matching_ngram_size must be greater than 0, but got {self.max_matching_ngram_size}"
 
     def print_pool(self):
@@ -82,16 +82,15 @@ def get_draft_tokens(self, prefix: list[torch.Tensor],
                     -1):
                 # Find each possible key-value combination, and use tuple for hash
                 for l in range(len(sequence) - size):
-                    r = min(l + size + self.prompt_lookup_num_tokens,
-                            len(sequence))
+                    r = min(l + size + self.max_draft_len, len(sequence))
                     key = tuple(sequence[l:l + size])
                     value = tuple(sequence[l + size:r])
                     if key not in self.pool[gbi] or not self.is_keep_all or \
-                        len(self.pool[gbi][key][0]) < self.prompt_lookup_num_tokens:
+                        len(self.pool[gbi][key][0]) < self.max_draft_len:
                         # Update the value if
                         # 1. the key does not exist
                         # 2. we only keep the newest one value for each key (MRU)
-                        # 3. the length of the value saved before is less than `prompt_lookup_num_tokens`
+                        # 3. the length of the value saved before is less than `max_draft_len`
                         self.pool[gbi][key] = OrderedSet((value, ))
                     elif value not in self.pool[gbi][key]:
                         # Extend the value if the key is already existed but count of values is not enough
@@ -113,26 +112,26 @@ def get_draft_tokens(self, prefix: list[torch.Tensor],
                 break
             draft_tokens.append(chosen_ids)
             self.start_index[gbi] = max(
-                0, prefix_len[bi] - (self.prompt_lookup_num_tokens +
-                                     self.max_matching_ngram_size - 1))
+                0, prefix_len[bi] -
+                (self.max_draft_len + self.max_matching_ngram_size - 1))
 
         return draft_tokens, None
 
 
-def run_dtm_pld(batch_input_ids,
-                args,
-                runtime_rank,
-                end_id,
-                pad_id,
-                stop_words_list,
-                bad_words_list,
-                vocab_size,
-                *,
-                target_runner=None):
-    # `dtm` for Draft-Target-Model, `pld` for Prompt-Lookup-Decoding
+def run_dtm_ngram(batch_input_ids,
+                  args,
+                  runtime_rank,
+                  end_id,
+                  pad_id,
+                  stop_words_list,
+                  bad_words_list,
+                  vocab_size,
+                  *,
+                  target_runner=None):
+    # `dtm` for Draft-Target-Model, `ngram` for NGram
     is_dtm = (args.draft_target_model_config is not None)
-    is_pld = (args.prompt_lookup_config is not None)
-    assert is_dtm ^ is_pld, "`--draft_target_model_config` and `--prompt_lookup_config` can not be specified at the same time."
+    is_ngram = (args.ngram_config is not None)
+    assert is_dtm ^ is_ngram, "`--draft_target_model_config` and `--ngram_config` can not be specified at the same time."
     if is_dtm:
         assert args.draft_engine_dir is not None, "`--draft_engine_dir` must be specified in Draft-Target-Model."
         draft_len, draft_device_list, target_device_list, use_logits = ast.literal_eval(
@@ -142,12 +141,11 @@ def run_dtm_pld(batch_input_ids,
         logger.info(f"Device(s) for draft model: {draft_device_list}")
         logger.info(f"Device(s) for target model: {target_device_list}")
         logger.info(f"Use logits to accept tokens: {use_logits}")
-    if is_pld:
-        logger.info(
-            f"Using Prompt-Lookup-Decoding speculative decoding V1 workflow")
-        prompt_lookup_num_tokens, max_matching_ngram_size, target_device_list = ast.literal_eval(
-            args.prompt_lookup_config)
-        logger.info(f"prompt_lookup_num_tokens: {prompt_lookup_num_tokens}")
+    if is_ngram:
+        logger.info(f"Using NGram speculative decoding V1 workflow")
+        max_draft_len, max_matching_ngram_size, target_device_list = ast.literal_eval(
+            args.ngram_config)
+        logger.info(f"max_draft_len: {max_draft_len}")
         logger.info(f"max_matching_ngram_size: {max_matching_ngram_size}")
         logger.info(f"Device(s) for the model: {target_device_list}")
         use_logits = False  # `logits` is useless in this approach yet
@@ -166,9 +164,9 @@ def run_dtm_pld(batch_input_ids,
         n_draft_token = [0 for _ in range(input_batch_size)]
         n_accept_token = [0 for _ in range(input_batch_size)]
 
-    if is_pld:
-        pld_pool = PLDPool(input_batch_size, prompt_lookup_num_tokens,
-                           max_matching_ngram_size, end_id, max_seq_len)
+    if is_ngram:
+        ngram_pool = NgramPool(input_batch_size, max_draft_len,
+                               max_matching_ngram_size, end_id, max_seq_len)
 
     # Repack the output like the output of function `generate`
     outputs = {}
@@ -297,8 +295,8 @@ def run_dtm_pld(batch_input_ids,
                 if use_logits:
                     d_logits[bi] = draft["generation_logits"][bi, 0,
                                                               -d_len[bi]:, :]
-        if is_pld:
-            d_ids, d_logits = pld_pool.get_draft_tokens(prefix, batch_slot)
+        if is_ngram:
+            d_ids, d_logits = ngram_pool.get_draft_tokens(prefix, batch_slot)
             d_len = [len(i) for i in d_ids]
 
         # Run target model
@@ -310,8 +308,8 @@ def run_dtm_pld(batch_input_ids,
                                         draft_logits_list=d_logits)
         if is_dtm:
             max_new_tokens = draft_len + 1
-        if is_pld:
-            max_new_tokens = prompt_lookup_num_tokens + 1
+        if is_ngram:
+            max_new_tokens = max_draft_len + 1
         target_generation_kwargs.update(max_new_tokens=max_new_tokens)
         target = target_runner.generate(**target_generation_kwargs)
         torch.cuda.synchronize()