mixtral-8x7b-instruct-vllm-a100-t-tp2 (#243)

tuhins · Tuhin Srivastava · web-flow · commit cd732f52802b · 2024-03-21T17:58:54.000-07:00
Mixtral 8x7B — VLLM TP2 — A100:2

---------

Co-authored-by: Tuhin Srivastava &lt;tuhinsrivastava@Tuhins-Laptop.local&gt;
diff --git a/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/README.md b/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/README.md
@@ -0,0 +1,62 @@
+# Mixtral 8x7B Instruct Truss
+
+This is a [Truss](https://truss.baseten.co/) for Mixtral 8x7B Instruct. Mixtral 8x7B Instruct parameter language model released by [Mistral AI](https://mistral.ai/). It is a mixture-of-experts (MoE) model. This README will walk you through how to deploy this Truss on Baseten to get your own instance of it.
+
+
+## Deployment
+
+First, clone this repository:
+
+```sh
+git clone https://github.com/basetenlabs/truss-examples/
+cd mixtral-8x7b-instruct-vllm
+```
+
+Before deployment:
+
+1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
+2. Install the latest version of Truss: `pip install --upgrade truss`
+
+With `mixtral-8x7b-instruct-vllm` as your working directory, you can deploy the model with:
+
+```sh
+truss push --publish
+```
+
+Paste your Baseten API key if prompted.
+
+For more information, see [Truss documentation](https://truss.baseten.co).
+
+### Hardware notes
+
+You need two A100s to run Mixtral at `fp16`. If you need access to A100s, please [contact us](mailto:support@baseten.co).
+
+## Mixtral 8x7B Instruct API documentation
+
+This section provides an overview of the Mixtral 8x7B Instruct API, its parameters, and how to use it. The API consists of a single route named  `predict`, which you can invoke to generate text based on the provided prompt.
+
+### API route: `predict`
+
+The `predict` route is the primary method for generating text completions based on a given prompt. It takes several parameters:
+
+- __prompt__: The input text that you want the model to generate a response for.
+- __stream__ (optional, default=False): A boolean determining whether the model should stream a response back. When `True`, the API returns generated text as it becomes available.
+
+## Example usage
+
+```sh
+truss predict -d '{"prompt": "What is the Mistral wind?"}'
+```
+
+You can also invoke your model via a REST API:
+
+```
+curl -X POST " https://app.baseten.co/model_versions/YOUR_MODEL_VERSION_ID/predict" \
+     -H "Content-Type: application/json" \
+     -H 'Authorization: Api-Key {YOUR_API_KEY}' \
+     -d '{
+           "prompt": "What is the meaning of life? Answer in substantial detail with multiple examples from famous philosophies, religions, and schools of thought.",
+           "stream": true,
+           "max_tokens": 4096
+         }' --no-buffer
+```
diff --git a/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/config.yaml b/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/config.yaml
@@ -0,0 +1,13 @@
+environment_variables: {}
+external_package_dirs: []
+model_name: Mixtral 8x7B — VLLM TP2 — A100:2
+python_version: py310
+requirements:
+- vllm
+resources:
+  accelerator: A100:2
+  use_gpu: true
+runtime:
+  predict_concurrency: 128
+secrets: {}
+system_packages: []
diff --git a/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/model/__init__.py b/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/model/__init__.py
diff --git a/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/model/model.py b/mistral/mixtral-8x7b-instruct-vllm-a100-t-tp2/model/model.py
@@ -0,0 +1,49 @@
+import subprocess
+import uuid
+
+from vllm import SamplingParams
+from vllm.engine.arg_utils import AsyncEngineArgs
+from vllm.engine.async_llm_engine import AsyncLLMEngine
+
+
+class Model:
+    def __init__(self, **kwargs):
+        self.model = None
+        self.llm_engine = None
+        self.model_args = None
+
+        command = "ray start --head"
+        subprocess.check_output(command, shell=True, text=True)
+
+    def load(self):
+        self.model_args = AsyncEngineArgs(
+            model="mistralai/Mixtral-8x7B-Instruct-v0.1",
+            tensor_parallel_size=2,
+            gpu_memory_utilization=0.95,
+            max_model_len=4096,
+        )
+        self.llm_engine = AsyncLLMEngine.from_engine_args(self.model_args)
+
+    async def predict(self, model_input):
+        prompt = model_input.pop("prompt")
+        stream = model_input.pop("stream", True)
+
+        sampling_params = SamplingParams(**model_input)
+        idx = str(uuid.uuid4().hex)
+        vllm_generator = self.llm_engine.generate(prompt, sampling_params, idx)
+
+        async def generator():
+            full_text = ""
+            async for output in vllm_generator:
+                text = output.outputs[0].text
+                delta = text[len(full_text) :]
+                full_text = text
+                yield delta
+
+        if stream:
+            return generator()
+        else:
+            full_text = ""
+            async for delta in generator():
+                full_text += delta
+            return {"text": full_text}