Set TP argument correctly when instantiating PagedKVCacheManager (IBM#94)

tdoublep · web-flow · commit ddc56ee758c7 · 2024-05-10T15:33:52.000+02:00
#### Motivation Users are seeing runtime errors when trying to use TP>1 with speculative decoding. #### Modifications We need to set the tensor parallel argument correctly when we instantiate the PagedKVCacheManager. #### Result I have verified that this change resolves the reported issue. #### Related Issues https://huggingface.co/ibm-fms/llama3-8b-accelerator/discussions/1 Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
diff --git a/server/text_generation_server/models/paged_causal_lm.py b/server/text_generation_server/models/paged_causal_lm.py
@@ -327,7 +327,7 @@ def __init__(
             model_config.num_attention_heads,
             model_config.hidden_size,
             kv_heads=model_config.num_key_value_heads,
-            tensor_parallel_size=1,
+            tensor_parallel_size=self.engine.world_size,
             dtype=dtype,
             device=self.device,
             total_num_gpu_blocks=total_num_gpu_blocks,