Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# TensorRT Torch Backend Briton with deepseek-ai/DeepSeek-R1-Distill-Llama-70B

This is a Deployment for TensorRT Torch Backend Briton with deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral)

With Briton you get the following benefits by default:
- *Lowest-latency* latency, beating frameworks such as vllm
- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching.
- *distributed inference* run large models (such as LLama-405B) tensor-parallel
- *json-schema based structured output for any model*
- *chunked prefilling* for long generation tasks

Optionally, you can also enable:
- *speculative decoding* using an external draft model or self-speculative decoding
- *fp8 quantization* deployments on H100, H200 and L4 GPUs

With the V2 Config, you can now also quantize models straight from huggingface in FP8 and FP4, and also use KV Caching.


# Examples:
This deployment is specifically designed for the Hugging Face model [deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B).
Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models.

deepseek-ai/DeepSeek-R1-Distill-Llama-70B is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more.


## Deployment with Truss

Before deployment:

1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
2. Install the latest version of Truss: `pip install --upgrade truss`


First, clone this repository:
```sh
git clone https://github.com/basetenlabs/truss-examples.git
cd 11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-llama-70b-EngineV2-fp4
```

With `11-embeddings-reranker-classification-tensorrt/Briton-deepseek-ai-deepseek-r1-distill-llama-70b-EngineV2-fp4` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted.

```sh
truss push --publish
# prints:
# ✨ Model Briton-deepseek-ai-deepseek-r1-distill-llama-70b-EngineV2-fp4-truss-example was successfully pushed ✨
# 🪵 View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx
```

## Call your model

### OpenAI compatible inference
This solution is OpenAI compatible, which means you can use the OpenAI client library to interact with the model.

```python
from openai import OpenAI
import os

client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Default completion
response_completion = client.completions.create(
model="not_required",
prompt="Q: Tell me everything about Baseten.co! A:",
temperature=0.3,
max_tokens=100,
)

# Chat completion
response_chat = client.chat.completions.create(
model="",
messages=[
{"role": "user", "content": "Tell me everything about Baseten.co!"}
],
temperature=0.3,
max_tokens=100,
)

# Structured output
from pydantic import BaseModel

class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]

completion = client.beta.chat.completions.parse(
model="not_required",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
],
response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed

# If you model supports tool-calling, you can use the following example:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": False
},
"strict": True
}
}]

completion = client.chat.completions.create(
model="not_required",
messages=[{"role": "user", "content": "What is the weather like in Paris today?"}],
tools=tools
)

print(completion.choices[0].message.tool_calls)
```


## Config.yaml
By default, the following configuration is used for this deployment. This config uses `quantization_type=fp4_kv`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16.

```yaml
model_metadata:
example_model_input:
max_tokens: 512
messages:
- content: Tell me everything you know about optimized inference.
role: user
stream: true
temperature: 0.5
tags:
- openai-compatible
model_name: Briton-deepseek-ai-deepseek-r1-distill-llama-70b-EngineV2-fp4-truss-example
python_version: py39
resources:
accelerator: B200
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
checkpoint_repository:
repo: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
revision: main
source: HF
quantization_type: fp4_kv
runtime:
max_batch_size: 32
max_num_tokens: 32768
max_seq_len: 32768
version_overrides:
briton_version: null
engine_builder_version: 0.20.0.post13.dev3

```

## Support
If you have any questions or need assistance, please open an issue in this repository or contact our support team.
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
model_metadata:
example_model_input:
max_tokens: 512
messages:
- content: Tell me everything you know about optimized inference.
role: user
stream: true
temperature: 0.5
tags:
- openai-compatible
model_name: Briton-deepseek-ai-deepseek-r1-distill-llama-70b-EngineV2-fp4-truss-example
python_version: py39
resources:
accelerator: B200
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
inference_stack: v2
build:
base_model: decoder
checkpoint_repository:
repo: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
revision: main
source: HF
quantization_type: fp4_kv
runtime:
max_batch_size: 32
max_num_tokens: 32768
max_seq_len: 32768
version_overrides:
briton_version: null
engine_builder_version: 0.20.0.post13.dev3
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# TensorRT Torch Backend Briton with meta-llama/Llama-3.2-3B-Instruct

This is a Deployment for TensorRT Torch Backend Briton with meta-llama/Llama-3.2-3B-Instruct. Briton is Baseten's solution for production-grade deployments via TensorRT-LLM for Causal Language Models models. (e.g. LLama, Qwen, Mistral)

With Briton you get the following benefits by default:
- *Lowest-latency* latency, beating frameworks such as vllm
- *Highest-throughput* inference, automatically using XQA kernels, paged kv caching and inflight batching.
- *distributed inference* run large models (such as LLama-405B) tensor-parallel
- *json-schema based structured output for any model*
- *chunked prefilling* for long generation tasks

Optionally, you can also enable:
- *speculative decoding* using an external draft model or self-speculative decoding
- *fp8 quantization* deployments on H100, H200 and L4 GPUs

With the V2 Config, you can now also quantize models straight from huggingface in FP8 and FP4, and also use KV Caching.


# Examples:
This deployment is specifically designed for the Hugging Face model [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).
Suitable models can be identified by the `ForCausalLM` suffix in the model name. Currently we support e.g. LLama, Qwen, Mistral models.

meta-llama/Llama-3.2-3B-Instruct is a text-generation model, used to generate text given a prompt. \nIt is frequently used in chatbots, text completion, structured output and more.

This model is quantized to FP8 for deployment, which is supported by Nvidia's newest GPUs e.g. H100, H100_40GB or L4. Quantization is optional, but leads to higher efficiency.

## Deployment with Truss

Before deployment:

1. Make sure you have a [Baseten account](https://app.baseten.co/signup) and [API key](https://app.baseten.co/settings/account/api_keys).
2. Install the latest version of Truss: `pip install --upgrade truss`
Note: [This is a gated/private model] Retrieve your Hugging Face token from the [settings](https://huggingface.co/settings/tokens). Set your Hugging Face token as a Baseten secret [here](https://app.baseten.co/settings/secrets) with the key `hf_access_token`. Do not set the actual value of key in the config.yaml. `hf_access_token: null` is fine - the true value will be fetched from the secret store.

First, clone this repository:
```sh
git clone https://github.com/basetenlabs/truss-examples.git
cd 11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct-EngineV2-fp8
```

With `11-embeddings-reranker-classification-tensorrt/Briton-meta-llama-llama-3.2-3b-instruct-EngineV2-fp8` as your working directory, you can deploy the model with the following command. Paste your Baseten API key if prompted.

```sh
truss push --publish
# prints:
# ✨ Model Briton-meta-llama-llama-3.2-3b-instruct-EngineV2-fp8-truss-example was successfully pushed ✨
# 🪵 View logs for your deployment at https://app.baseten.co/models/yyyyyy/logs/xxxxxx
```

## Call your model

### OpenAI compatible inference
This solution is OpenAI compatible, which means you can use the OpenAI client library to interact with the model.

```python
from openai import OpenAI
import os

client = OpenAI(
api_key=os.environ['BASETEN_API_KEY'],
base_url="https://model-xxxxxx.api.baseten.co/environments/production/sync/v1"
)

# Default completion
response_completion = client.completions.create(
model="not_required",
prompt="Q: Tell me everything about Baseten.co! A:",
temperature=0.3,
max_tokens=100,
)

# Chat completion
response_chat = client.chat.completions.create(
model="",
messages=[
{"role": "user", "content": "Tell me everything about Baseten.co!"}
],
temperature=0.3,
max_tokens=100,
)

# Structured output
from pydantic import BaseModel

class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]

completion = client.beta.chat.completions.parse(
model="not_required",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
],
response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed

# If you model supports tool-calling, you can use the following example:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Bogotá, Colombia"
}
},
"required": [
"location"
],
"additionalProperties": False
},
"strict": True
}
}]

completion = client.chat.completions.create(
model="not_required",
messages=[{"role": "user", "content": "What is the weather like in Paris today?"}],
tools=tools
)

print(completion.choices[0].message.tool_calls)
```


## Config.yaml
By default, the following configuration is used for this deployment. This config uses `quantization_type=fp8_kv`. This is optional, remove the `quantization_type` field or set it to `no_quant` for float16/bfloat16.
Note: [This is a gated/private model] Retrieve your Hugging Face token from the [settings](https://huggingface.co/settings/tokens). Set your Hugging Face token as a Baseten secret [here](https://app.baseten.co/settings/secrets) with the key `hf_access_token`. Do not set the actual value of key in the config.yaml. `hf_access_token: null` is fine - the true value will be fetched from the secret store.
```yaml
model_metadata:
example_model_input:
max_tokens: 512
messages:
- content: Tell me everything you know about optimized inference.
role: user
stream: true
temperature: 0.5
tags:
- openai-compatible
model_name: Briton-meta-llama-llama-3.2-3b-instruct-EngineV2-fp8-truss-example
python_version: py39
resources:
accelerator: H100_40GB
cpu: '1'
memory: 10Gi
use_gpu: true
trt_llm:
build:
checkpoint_repository:
repo: meta-llama/Llama-3.2-3B-Instruct
revision: main
source: HF
quantization_type: fp8_kv
runtime:
max_batch_size: 32
max_num_tokens: 32768
max_seq_len: 32768
version_overrides:
briton_version: null
engine_builder_version: 0.20.0.post13.dev3

```

## Support
If you have any questions or need assistance, please open an issue in this repository or contact our support team.
Loading
Loading