basetenlabs
diff --git a/‎llama/solar-10b-trt-llm/TRT-LLM-README.md‎
Lines changed: 91 additions & 0 deletions b/‎llama/solar-10b-trt-llm/TRT-LLM-README.md‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎llama/solar-10b-trt-llm/config.yaml‎
Lines changed: 28 additions & 0 deletions b/‎llama/solar-10b-trt-llm/config.yaml‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎llama/solar-10b-trt-llm/data/.gitattributes‎
Lines changed: 37 additions & 0 deletions b/‎llama/solar-10b-trt-llm/data/.gitattributes‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎llama/solar-10b-trt-llm/model/__init__.py‎ b/‎llama/solar-10b-trt-llm/model/__init__.py‎
diff --git a/‎llama/solar-10b-trt-llm/model/model.py‎
Lines changed: 139 additions & 0 deletions b/‎llama/solar-10b-trt-llm/model/model.py‎
Lines changed: 139 additions & 0 deletions
@@ -0,0 +1,91 @@
+
+
+# TRTLLM
+
+### Overview
+This Truss adds support for TRT-LLM engines via Triton Inference Server. TRT-LLM is a highly-performant language model runtime. We leverage the C++ runtime to take advantage of in-flight batching (aka continous batching).
+
+### Prerequisites
+
+To use this Truss, your engine must be built with in-flight batching support. Refer to your architecture-specific `build.py` re: how to build with in-flight-batching support.
+
+### Config
+
+This Truss is primarily config driven. This means that most settings you'll need to edit are located in the `config.yaml`. These settings are all located underneath the `model_metadata` key.
+
+- `tensor_parallelism` (int): If you built your model with tensor parallelism support, you'll need to set this value with the same value used during the build engine step. This value should be the same as the number of GPUs in the `resources` section.
+
+*Pipeline parallelism is not supported in this version but will be added later. As noted from Nvidia, pipeline parallelism reduces the need for high-bandwidth communication but may incur load-balancing issues and may be less efficient in terms of GPU utilization.*
+
+- `engine_repository` (str): We expect engines to be uploaded to Huggingface with a flat directory structure (i.e the engine and associated files are not underneath a folder structure). This value is the full `{org_name}/{repo_name}` string. Engines can be private or public.
+
+- `tokenizer_repository` (str): Engines do not come bundled with their own tokenizer. This is the Huggingface repository where we can find a tokenizer. Tokenizers can be private or public.
+
+If the engine and repository tokenizers are private, you'll need to update the `secrets` section of the `config.yaml` as follows:
+
+```
+secrets:
+ hf_access_token: "my_hf_api_key"
+```
+
+### Performance
+
+TRT-LLM engines are designed to be highly performant. Once your Truss has been deployed, you may find that you're not fully utilizing the GPU. The following are levers to improve performance but require trial-and-error to identify appropriates. All of these values live inside the `config.pbtxt` for a given ensemble model.
+
+#### Preprocessing / Postprocessing
+
+```
+instance_group [
+    {
+        count: 1
+        kind: KIND_CPU
+    }
+]
+```
+By default, we load 1 instance of the pre/post models. If you find that the tokenizer is a bottleneck, increasing the `count` variable here will load more replicas of these models and Triton will automatically load balance across model instances.
+
+### Tensorrt LLM
+```
+parameters: {
+  key: "max_tokens_in_paged_kv_cache"
+  value: {
+    string_value: "10000"
+  }
+}
+```
+By default, we set the `max_tokens_in_paged_kv_cache` to 10000. For a 13B model on 1 A100 with a batch size of 8, we have over 60GB of GPU memory left over. We can increase this value to 100k comfortably and allow for more tokens in the KV cache. Your mileage will vary based on the size of your model and the hardware you're running on.
+
+```
+parameters: {
+  key: "kv_cache_free_gpu_mem_fraction"
+  value: {
+    string_value: "0.1"
+  }
+}
+```
+By default, if `max_tokens_in_paged_kv_cache` is unset, Triton Inference Server will attempt to preallocate `kv_cache_free_gpu_mem_fraction` fraction of free gpu memory for the KV cache.
+
+```
+parameters: {
+  key: "max_num_sequences"
+  value: {
+    string_value: "64"
+  }
+}
+```
+The `max_num_sequences` param is the maximum numbers of requests that the inference server can maintain state for at a given time (state = KV cache + decoder state).
+See this [comment](https://github.com/NVIDIA/TensorRT-LLM/issues/65#issuecomment-1774332446) for more details. Setting this value higher allows for more parallel processing but uses more GPU memory.
+
+### API
+
+We expect requests will the following information:
+
+
+- ```prompt``` (str): The prompt you'd like to complete
+- ```max_tokens``` (int, default: 50): The max token count. This includes the number of tokens in your prompt so if this value is less than your prompt, you'll just recieve a truncated version of the prompt.
+- ```beam_width``` (int, default:50): The number of beams to compute. This must be 1 for this version of TRT-LLM. Inflight-batching does not support beams > 1.
+- ```bad_words_list``` (list, default:[]): A list of words to not include in generated output.
+- ```stop_words_list``` (list, default:[]): A list of words to stop generation upon encountering.
+- ```repetition_penalty``` (float, defualt: 1.0): A repetition penalty to incentivize not repeating tokens.
+
+This Truss will stream responses back. Responses will be buffered chunks of text.
@@ -0,0 +1,28 @@
+base_image:
+  image: docker.io/baseten/trtllm-server:r23.12_baseten_v0.9.0.dev2024022000
+  python_executable_path: /usr/bin/python3
+description: Generate text from a prompt with this seven billion parameter language
+  model.
+environment_variables:
+  HF_HUB_ENABLE_HF_TRANSFER: true
+external_package_dirs: []
+model_metadata:
+  avatar_url: https://cdn.baseten.co/production/static/explore/meta.png
+  cover_image_url: https://cdn.baseten.co/production/static/explore/llama.png
+  engine_repository: baseten/solar10.7
+  tags:
+  - text-generation
+  tensor_parallelism: 1
+  tokenizer_repository: upstage/SOLAR-10.7B-Instruct-v1.0
+model_name: Solar 10.7B
+python_version: py311
+requirements:
+- tritonclient[all]
+- hf_transfer
+resources:
+  accelerator: H100
+  use_gpu: true
+runtime:
+  predict_concurrency: 256
+secrets: {}
+system_packages: []
@@ -0,0 +1,37 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+gpt_float16_tp2_rank0.engine filter=lfs diff=lfs merge=lfs -text
+gpt_float16_tp2_rank1.engine filter=lfs diff=lfs merge=lfs -text
@@ -0,0 +1,139 @@
+import os
+from itertools import count
+from pathlib import Path
+from threading import Thread
+
+import numpy as np
+from client import TritonClient, UserData
+from transformers import AutoTokenizer
+from utils import download_engine, prepare_grpc_tensor, server_loaded
+
+TRITON_MODEL_REPOSITORY_PATH = Path("/packages/inflight_batcher_llm/")
+
+
+class Model:
+    def __init__(self, **kwargs):
+        self._data_dir = kwargs["data_dir"]
+        self._config = kwargs["config"]
+        self._secrets = kwargs["secrets"]
+        self._request_id_counter = count(start=1)
+        self.triton_client = None
+        self.tokenizer = None
+        self.uses_openai_api = (
+            "openai-compatible" in self._config["model_metadata"]["tags"]
+        )
+
+    def load(self):
+        tensor_parallel_count = self._config["model_metadata"].get(
+            "tensor_parallelism", 1
+        )
+        pipeline_parallel_count = self._config["model_metadata"].get(
+            "pipeline_parallelism", 1
+        )
+        if "hf_access_token" in self._secrets._base_secrets.keys():
+            hf_access_token = self._secrets["hf_access_token"]
+        else:
+            hf_access_token = None
+        is_external_engine_repo = "engine_repository" in self._config["model_metadata"]
+
+        # Instantiate TritonClient
+        self.triton_client = TritonClient(
+            data_dir=self._data_dir,
+            model_repository_dir=TRITON_MODEL_REPOSITORY_PATH,
+            parallel_count=tensor_parallel_count * pipeline_parallel_count,
+        )
+
+        # Download model from Hugging Face Hub if specified
+        if is_external_engine_repo:
+            if not server_loaded():
+                download_engine(
+                    engine_repository=self._config["model_metadata"][
+                        "engine_repository"
+                    ],
+                    fp=self._data_dir,
+                    auth_token=hf_access_token,
+                )
+
+        # Load Triton Server and model
+        tokenizer_repository = self._config["model_metadata"]["tokenizer_repository"]
+        env = {"triton_tokenizer_repository": tokenizer_repository}
+        if hf_access_token is not None:
+            env["HUGGING_FACE_HUB_TOKEN"] = hf_access_token
+
+        self.triton_client.load_server_and_model(env=env)
+
+        # setup eos token
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            tokenizer_repository, token=hf_access_token
+        )
+        self.eos_token_id = self.tokenizer.eos_token_id
+
+    def predict(self, model_input):
+        user_data = UserData()
+        model_name = "ensemble"
+        stream_uuid = str(os.getpid()) + str(next(self._request_id_counter))
+
+        if self.uses_openai_api:
+            prompt = self.tokenizer.apply_chat_template(
+                model_input.get("messages"),
+                tokenize=False,
+            )
+        else:
+            prompt = model_input.get("prompt")
+
+        max_tokens = model_input.get("max_tokens", 50)
+        beam_width = model_input.get("beam_width", 1)
+        bad_words_list = model_input.get("bad_words_list", [""])
+        stop_words_list = model_input.get("stop_words_list", [""])
+        repetition_penalty = model_input.get("repetition_penalty", 1.0)
+        ignore_eos = model_input.get("ignore_eos", False)
+        stream = model_input.get("stream", True)
+
+        input0 = [[prompt]]
+        input0_data = np.array(input0).astype(object)
+        output0_len = np.ones_like(input0).astype(np.uint32) * max_tokens
+        bad_words_list = np.array([bad_words_list], dtype=object)
+        stop_words_list = np.array([stop_words_list], dtype=object)
+        stream_data = np.array([[stream]], dtype=bool)
+        beam_width_data = np.array([[beam_width]], dtype=np.uint32)
+        repetition_penalty_data = np.array([[repetition_penalty]], dtype=np.float32)
+
+        inputs = [
+            prepare_grpc_tensor("text_input", input0_data),
+            prepare_grpc_tensor("max_tokens", output0_len),
+            prepare_grpc_tensor("bad_words", bad_words_list),
+            prepare_grpc_tensor("stop_words", stop_words_list),
+            prepare_grpc_tensor("stream", stream_data),
+            prepare_grpc_tensor("beam_width", beam_width_data),
+            prepare_grpc_tensor("repetition_penalty", repetition_penalty_data),
+        ]
+
+        if not ignore_eos:
+            end_id_data = np.array([[self.eos_token_id]], dtype=np.uint32)
+            inputs.append(prepare_grpc_tensor("end_id", end_id_data))
+        else:
+            # do nothing, trt-llm by default doesn't stop on `eos`
+            pass
+
+        # Start GRPC stream in a separate thread
+        stream_thread = Thread(
+            target=self.triton_client.start_grpc_stream,
+            args=(user_data, model_name, inputs, stream_uuid),
+        )
+        stream_thread.start()
+
+        def generate():
+            # Yield results from the queue
+            for i in TritonClient.stream_predict(user_data):
+                yield i
+
+            # Clean up GRPC stream and thread
+            self.triton_client.stop_grpc_stream(stream_uuid, stream_thread)
+
+        if stream:
+            return generate()
+        else:
+            if self.uses_openai_api:
+                return "".join(generate())
+            else:
+                return {"text": "".join(generate())}