triton with vllm chat completions

hi,

how can i do the following with triton and vllm backend? i only found a generate endpoint - why i cannot find /chat/completions? and how to set max_num_tokens?

```
import requests
import json

# Your local vLLM server (OpenAI-compatible)
API_URL = "http://localhost:8000/v1/chat/completions"

# Chat-style messages
messages = [
    {"role": "system", "content": "You are a friendly assistant that writes in haiku form."},
    {"role": "user", "content": "Write a haiku about coding on a Mac."}
]

# Request payload
payload = {
    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "messages": messages,
    "max_tokens": 50,
    "temperature": 0.7
}

# Send request
response = requests.post(API_URL, headers={"Content-Type": "application/json"}, data=json.dumps(payload))

# Print nicely formatted output
if response.ok:
    data = response.json()
    print("Assistant:", data["choices"][0]["message"]["content"])
else:
    print("Error:", response.text)
```

cc @pskiran1 @dinhanhx @amit-timalsina @ahakanbaba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

triton with vllm chat completions #8475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

triton with vllm chat completions #8475

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions