-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Description
hi,
how can i do the following with triton and vllm backend? i only found a generate endpoint - why i cannot find /chat/completions? and how to set max_num_tokens?
import requests
import json
# Your local vLLM server (OpenAI-compatible)
API_URL = "http://localhost:8000/v1/chat/completions"
# Chat-style messages
messages = [
{"role": "system", "content": "You are a friendly assistant that writes in haiku form."},
{"role": "user", "content": "Write a haiku about coding on a Mac."}
]
# Request payload
payload = {
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": messages,
"max_tokens": 50,
"temperature": 0.7
}
# Send request
response = requests.post(API_URL, headers={"Content-Type": "application/json"}, data=json.dumps(payload))
# Print nicely formatted output
if response.ok:
data = response.json()
print("Assistant:", data["choices"][0]["message"]["content"])
else:
print("Error:", response.text)
Metadata
Metadata
Assignees
Labels
No labels