Skip to content

Commit 33355de

Browse files
committed
Update Blog “using-structured-outputs-in-vllm”
1 parent 230fc96 commit 33355de

File tree

2 files changed

+165
-163
lines changed

2 files changed

+165
-163
lines changed

content/blog/using-structured-outputs-in-vllm.md

Lines changed: 165 additions & 163 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Using structured outputs in vLLM
33
date: 2025-03-16T19:28:00.657Z
44
author: Ismael Delgado Muñoz
55
authorimage: /img/Avatar6.svg
6-
thumbnailimage: /img/structured_outputs_thumbnail.png
6+
thumbnailimage: ""
77
disable: false
88
tags:
99
- AI
@@ -12,166 +12,168 @@ tags:
1212
- LLM
1313
---
1414
## Using structured outputs in vLLM
15-
Generating predictable and reliable outputs from large language models (LLMs) can be challenging, especially when those outputs need to integrate seamlessly with downstream systems. Structured outputs solve this problem by enforcing specific formats, such as JSON, regex patterns, or even grammars. vLLM supported this since some time ago, but there were no documentation on how to use it, and that´s why I decided to do a contribution and write the Structured Outputs documentation page (https://docs.vllm.ai/en/latest/usage/structured_outputs.html).
16-
17-
### Why Structured Outputs?
18-
19-
LLMs are incredibly powerful, but their outputs can be inconsistent when a specific format is required. Structured outputs address this issue by restricting the model’s generated text to adhere to predefined rules or formats, ensuring:
20-
21-
1. **Reliability:** Outputs are predictable and machine-readable.
22-
2. **Compatibility:** Seamless integration with APIs, databases, or other systems.
23-
3. **Efficiency:** No need for extensive post-processing to validate or fix outputs.
24-
25-
Imagine we have an external system which receives a JSON with the all the details to trigger an alert, and we want our LLM-based system to be able to use it. Of course we can try to explain the LLM what should be the output format and that it must be a valid JSON, but LLMs are not deterministic and thus we may end up with an invalid JSON. Probably, if you have tried to do something like this before, you would have found yourself in this situation.
26-
27-
How these tools work? The idea is that we´ll be able to filter the list of possible next tokens to force that we are always generating a token that is valid for the desired output format.
28-
29-
### What is vLLM?
30-
31-
vLLM is a state-of-the-art, open-source inference and serving engine for LLMs. It’s built for performance and simplicity, offering:
32-
33-
- **PagedAttention:** An innovative memory management mechanism for efficient attention key-value handling.
34-
- **Continuous Batching:** Supports concurrent requests dynamically.
35-
- **Advanced Optimizations:** Includes features like quantization, speculative decoding, and CUDA graphs.
36-
37-
These optimizations make vLLM one of the fastest and most versatile engines for production environments.
38-
39-
### Structured outputs on vLLM
40-
41-
vLLM extends the OpenAI API with additional parameters to enable structured outputs. These include:
42-
43-
- **`guided_choice`:** Restricts output to a set of predefined choices.
44-
- **`guided_regex`:** Ensures outputs match a given regex pattern.
45-
- **`guided_json`:** Validates outputs against a JSON schema.
46-
- **`guided_grammar`:** Enforces structure using context-free grammars.
47-
48-
Here’s how each works, along with example outputs:
49-
50-
#### **1. Guided Choice**
51-
52-
Simplest form of structured output, ensuring the response is one of a set of predefined options.
53-
54-
```python
55-
from openai import OpenAI
56-
57-
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
58-
59-
completion = client.chat.completions.create(
60-
model="Qwen/Qwen2.5-3B-Instruct",
61-
messages=[
62-
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
63-
],
64-
extra_body={"guided_choice": ["positive", "negative"]},
65-
)
66-
print(completion.choices[0].message.content)
67-
```
68-
69-
**Example Output:**
70-
71-
```
72-
positive
73-
```
74-
75-
#### **2. Guided Regex**
76-
77-
Constrains output to match a regex pattern, useful for formats like email addresses.
78-
79-
```python
80-
completion = client.chat.completions.create(
81-
model="Qwen/Qwen2.5-3B-Instruct",
82-
messages=[
83-
{
84-
"role": "user",
85-
"content": "Generate an example email address for Alan Turing at Enigma. End in .com.",
86-
}
87-
],
88-
extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
89-
)
90-
print(completion.choices[0].message.content)
91-
```
92-
93-
**Example Output:**
94-
95-
```
96-
97-
```
98-
99-
#### **3. Guided JSON**
100-
101-
Enforces a valid JSON format based on a schema, simplifying integration with other systems.
102-
103-
```python
104-
from pydantic import BaseModel
105-
from enum import Enum
106-
107-
class CarType(str, Enum):
108-
sedan = "sedan"
109-
suv = "SUV"
110-
truck = "Truck"
111-
coupe = "Coupe"
112-
113-
class CarDescription(BaseModel):
114-
brand: str
115-
model: str
116-
car_type: CarType
117-
118-
json_schema = CarDescription.model_json_schema()
119-
120-
completion = client.chat.completions.create(
121-
model="Qwen/Qwen2.5-3B-Instruct",
122-
messages=[
123-
{"role": "user", "content": "Generate a JSON for the most iconic car from the 90s."}
124-
],
125-
extra_body={"guided_json": json_schema},
126-
)
127-
print(completion.choices[0].message.content)
128-
```
129-
130-
**Example Output:**
131-
132-
```json
133-
{
134-
"brand": "Toyota",
135-
"model": "Supra",
136-
"car_type": "coupe"
137-
}
138-
```
139-
140-
#### **4. Guided Grammar**
141-
142-
Uses an EBNF grammar to define complex output structures, such as SQL queries.
143-
144-
```python
145-
completion = client.chat.completions.create(
146-
model="Qwen/Qwen2.5-3B-Instruct",
147-
messages=[
148-
{"role": "user", "content": "Generate a SQL query to find all users older than 30."}
149-
],
150-
extra_body={
151-
"guided_grammar": """
152-
query ::= "SELECT" fields "FROM users WHERE" condition;
153-
fields ::= "name, age" | "*";
154-
condition ::= "age >" number;
155-
number ::= [0-9]+;
156-
"""
157-
},
158-
)
159-
print(completion.choices[0].message.content)
160-
```
161-
162-
**Example Output:**
163-
164-
```sql
165-
SELECT * FROM users WHERE age > 30;
166-
```
167-
168-
### **Next Steps**
169-
170-
To start integrating structured outputs into your projects:
171-
172-
1. **Explore the Documentation:** Check out the official documentation for more examples and detailed explanations.
173-
2. **Install vLLM Locally:** Set up the inference server on your local machine using the vLLM GitHub repository.
174-
3. **Experiment with Structured Outputs:** Try out different formats (choice, regex, JSON, grammar) and observe how they can simplify your workflow.
175-
4. **Deploy in Production:** Once comfortable, deploy vLLM to your production environment and integrate it with your applications.
176-
15+
16+
Generating predictable and reliable outputs from large language models (LLMs) can be challenging, especially when those outputs need to integrate seamlessly with downstream systems. Structured outputs solve this problem by enforcing specific formats, such as JSON, regex patterns, or even grammars. vLLM supported this since some time ago, but there were no documentation on how to use it, and that´s why I decided to do a contribution and write the Structured Outputs documentation page (https://docs.vllm.ai/en/latest/usage/structured_outputs.html).
17+
18+
### Why Structured Outputs?
19+
20+
LLMs are incredibly powerful, but their outputs can be inconsistent when a specific format is required. Structured outputs address this issue by restricting the model’s generated text to adhere to predefined rules or formats, ensuring:
21+
22+
1. **Reliability:** Outputs are predictable and machine-readable.
23+
2. **Compatibility:** Seamless integration with APIs, databases, or other systems.
24+
3. **Efficiency:** No need for extensive post-processing to validate or fix outputs.
25+
26+
Imagine we have an external system which receives a JSON with the all the details to trigger an alert, and we want our LLM-based system to be able to use it. Of course we can try to explain the LLM what should be the output format and that it must be a valid JSON, but LLMs are not deterministic and thus we may end up with an invalid JSON. Probably, if you have tried to do something like this before, you would have found yourself in this situation.
27+
28+
How these tools work? The idea is that we´ll be able to filter the list of possible next tokens to force that we are always generating a token that is valid for the desired output format.
29+
![Structured outputs using vLLM](/img/structured_outputs_thumbnail.png "Structured outputs using vLLM")
30+
31+
### What is vLLM?
32+
33+
vLLM is a state-of-the-art, open-source inference and serving engine for LLMs. It’s built for performance and simplicity, offering:
34+
35+
* **PagedAttention:** An innovative memory management mechanism for efficient attention key-value handling.
36+
* **Continuous Batching:** Supports concurrent requests dynamically.
37+
* **Advanced Optimizations:** Includes features like quantization, speculative decoding, and CUDA graphs.
38+
39+
These optimizations make vLLM one of the fastest and most versatile engines for production environments.
40+
41+
### Structured outputs on vLLM
42+
43+
vLLM extends the OpenAI API with additional parameters to enable structured outputs. These include:
44+
45+
* **`guided_choice`:** Restricts output to a set of predefined choices.
46+
* **`guided_regex`:** Ensures outputs match a given regex pattern.
47+
* **`guided_json`:** Validates outputs against a JSON schema.
48+
* **`guided_grammar`:** Enforces structure using context-free grammars.
49+
50+
Here’s how each works, along with example outputs:
51+
52+
#### **1. Guided Choice**
53+
54+
Simplest form of structured output, ensuring the response is one of a set of predefined options.
55+
56+
```python
57+
from openai import OpenAI
58+
59+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
60+
61+
completion = client.chat.completions.create(
62+
model="Qwen/Qwen2.5-3B-Instruct",
63+
messages=[
64+
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
65+
],
66+
extra_body={"guided_choice": ["positive", "negative"]},
67+
)
68+
print(completion.choices[0].message.content)
69+
```
70+
71+
**Example Output:**
72+
73+
```
74+
positive
75+
```
76+
77+
#### **2. Guided Regex**
78+
79+
Constrains output to match a regex pattern, useful for formats like email addresses.
80+
81+
```python
82+
completion = client.chat.completions.create(
83+
model="Qwen/Qwen2.5-3B-Instruct",
84+
messages=[
85+
{
86+
"role": "user",
87+
"content": "Generate an example email address for Alan Turing at Enigma. End in .com.",
88+
}
89+
],
90+
extra_body={"guided_regex": r"\w+@\w+\.com\n", "stop": ["\n"]},
91+
)
92+
print(completion.choices[0].message.content)
93+
```
94+
95+
**Example Output:**
96+
97+
```
98+
99+
```
100+
101+
#### **3. Guided JSON**
102+
103+
Enforces a valid JSON format based on a schema, simplifying integration with other systems.
104+
105+
```python
106+
from pydantic import BaseModel
107+
from enum import Enum
108+
109+
class CarType(str, Enum):
110+
sedan = "sedan"
111+
suv = "SUV"
112+
truck = "Truck"
113+
coupe = "Coupe"
114+
115+
class CarDescription(BaseModel):
116+
brand: str
117+
model: str
118+
car_type: CarType
119+
120+
json_schema = CarDescription.model_json_schema()
121+
122+
completion = client.chat.completions.create(
123+
model="Qwen/Qwen2.5-3B-Instruct",
124+
messages=[
125+
{"role": "user", "content": "Generate a JSON for the most iconic car from the 90s."}
126+
],
127+
extra_body={"guided_json": json_schema},
128+
)
129+
print(completion.choices[0].message.content)
130+
```
131+
132+
**Example Output:**
133+
134+
```json
135+
{
136+
"brand": "Toyota",
137+
"model": "Supra",
138+
"car_type": "coupe"
139+
}
140+
```
141+
142+
#### **4. Guided Grammar**
143+
144+
Uses an EBNF grammar to define complex output structures, such as SQL queries.
145+
146+
```python
147+
completion = client.chat.completions.create(
148+
model="Qwen/Qwen2.5-3B-Instruct",
149+
messages=[
150+
{"role": "user", "content": "Generate a SQL query to find all users older than 30."}
151+
],
152+
extra_body={
153+
"guided_grammar": """
154+
query ::= "SELECT" fields "FROM users WHERE" condition;
155+
fields ::= "name, age" | "*";
156+
condition ::= "age >" number;
157+
number ::= [0-9]+;
158+
"""
159+
},
160+
)
161+
print(completion.choices[0].message.content)
162+
```
163+
164+
**Example Output:**
165+
166+
```sql
167+
SELECT * FROM users WHERE age > 30;
168+
```
169+
170+
### **Next Steps**
171+
172+
To start integrating structured outputs into your projects:
173+
174+
1. **Explore the Documentation:** Check out the official documentation for more examples and detailed explanations.
175+
2. **Install vLLM Locally:** Set up the inference server on your local machine using the vLLM GitHub repository.
176+
3. **Experiment with Structured Outputs:** Try out different formats (choice, regex, JSON, grammar) and observe how they can simplify your workflow.
177+
4. **Deploy in Production:** Once comfortable, deploy vLLM to your production environment and integrate it with your applications.
178+
177179
Structured outputs make LLMs not only powerful but also practical for real-world applications. Dive in and see what you can build!
-80.3 KB
Loading

0 commit comments

Comments
 (0)