Skip to content

Commit 833f68f

Browse files
authored
Add abridged version of the .txt article on Coding For Structured Generation (#1012)
I've added an abridged version this [post on the .txt blog](https://blog.dottxt.co/coding-for-structured-generation.html) to the cookbook that should provide a good overview of a basic workflow for developing code when working with structured generation.
1 parent a643cb0 commit 833f68f

File tree

2 files changed

+215
-0
lines changed

2 files changed

+215
-0
lines changed
71.9 KB
Loading
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# Structured Generation Workflow: Generating Synthetic Phone Numbers
2+
3+
This is a condensed version of [Coding for Structured Generation with LLMs](https://blog.dottxt.co/coding-for-structured-generation.html).
4+
5+
For this example we're going to be building an LLM program to generate **synthetic data** in the form of realistic looking phone numbers for Washington State. Using an LLM for this task *is a bit overkill* since we could just as easily accomplish this
6+
with a tool like [Faker](https://fakerjs.dev/), but this example still serves as a useful way to demonstrate a workflow for using structured generation.
7+
8+
## Unstructured approach
9+
10+
Before diving into how to use structure generation for this task let's start with an unstructured example. We begin by loading our model:
11+
12+
```python
13+
import outlines
14+
15+
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
16+
model = outlines.models.transformers(model_name)
17+
```
18+
19+
Next we need a prompt for this model. Since we're focusing on structured generation, we won't be engaging in any form of "prompt hacking" and will be leaving this prompt untouched for the rest of this example.
20+
21+
```python
22+
tokenizer = AutoTokenizer.from_pretrained(model_name)
23+
24+
messages_phone = [
25+
{"role": "user", "content": """
26+
Please generate a realistic phone number for Washington State in the following format
27+
28+
(555) 555-5555
29+
30+
"""}
31+
]
32+
33+
# This allows us to properly format our prompt for
34+
# Mistrals 'Instruct' interface.
35+
prompt_phone = tokenizer.apply_chat_template(messages_phone, tokenize=False)
36+
```
37+
38+
With our prompt ready we can now generate 10 example phone numbers
39+
40+
```python
41+
phone_generator_unstruct = outlines.generate.text(model)
42+
for _ in range(10):
43+
print(phone_generator_unstruct(prompt_phone,max_tokens=12))
44+
```
45+
46+
> I'd be happy to help you generate a realistic phone\
47+
I cannot generate a real phone number as I'm just\
48+
I'm an AI and don't have the ability\
49+
Sure! Here is a randomly generated phone number in the format\
50+
Here's a phone number that fits the format for a\
51+
In Washington State, phone numbers typically have a three-dig\
52+
Here are a few examples of phone numbers that could be considered\
53+
I'd be happy to help generate a realistic phone number\
54+
I'd be happy to help you generate a random phone\
55+
Based on the format you provided, a realistic phone number for\
56+
57+
As we can see, none of these outputs are even phone numbers!
58+
59+
Let's see if we can improve this using structured generation.
60+
61+
## The Structured Generation Workflow
62+
63+
In order to solve this problem we're going to introduce a *Structured Generation Workflow* outlined in this image:
64+
65+
!["Visual of Structured Generation Workflow"](./images/coding_structure_diagram.png)
66+
67+
Let's step through this:
68+
69+
### Real example
70+
71+
We start with a real example phone number, in this case for the Seattle Public Library, that we can use to verify the structure we are creating.
72+
73+
```python
74+
phone_number = "(206) 386-4636"
75+
```
76+
77+
For a simple example like this, we'll just be using a single phone number, for more complex examples it can be helpful to have more examples.
78+
79+
### Draft Structure
80+
81+
The next step in the process is for use to define a simple regex that we feel correctly models our real data.
82+
83+
```python
84+
phone_regex_1 = r'\([0-9]{3}\) [0-9]{3}-[0-9]{4}'
85+
```
86+
87+
Next we need to validate this regex against our real data.
88+
89+
### Validate by matching examples
90+
91+
Whenever writing non-trivial code with structured generation it is *essential* that you first validate the code against your real data example(s).
92+
93+
We'll start with a simple method of validation: just checking that our regex matches the data.
94+
95+
```
96+
import re
97+
re.match(phone_regex_1, phone_number)
98+
99+
# <re.Match object; span=(0, 14), match='(206) 386-4636'>
100+
101+
```
102+
103+
Now that we have a match, we can move on to generating structured output!
104+
105+
### Generate Structure
106+
107+
We're ready to see if structured generation can make an improvement over our initial unstructured approach:
108+
109+
```python
110+
phone_generator_v1 = outlines.generate.regex(model, phone_regex_1)
111+
for _ in range(10):
112+
print(phone_generator_v1(prompt_phone))
113+
```
114+
> (206) 555-1234\
115+
(206) 555-1234\
116+
(206) 555-1234\
117+
(206) 555-1234\
118+
(206) 555-1234\
119+
(206) 555-1234\
120+
(206) 123-4567\
121+
(206) 555-1234\
122+
(206) 555-1234\
123+
(206) 555-1234
124+
125+
At least we have phone numbers! But I think we can do better!
126+
127+
### Inspect output
128+
129+
In this case the model *did* create phone numbers and, impressively, got the area code correct. So using structured generation did improve things. However these numbers are pretty boring. Let's improve that structure!
130+
131+
## Iteration
132+
133+
We've walked through the loop once, so we can go quickly now through each iteration.
134+
135+
We start by improving our structure:
136+
137+
```python
138+
phone_regex_2 = r'\([0-9]{3}\) [2-46-9]{3}-[02-9]{4}'
139+
```
140+
141+
Before rushing to another round of generation, let's validate this new regex. We'll add just a bit more sophistication over our last check:
142+
143+
```python
144+
re.match(phone_regex_2, phone_number)[0] == phone_number
145+
# True
146+
```
147+
Now that we've validated, let's generate with this new regex!
148+
149+
```python
150+
phone_generator_v2 = outlines.generate.regex(model,
151+
phone_regex_2)
152+
for _ in range(10):
153+
print(phone_generator_v2(prompt_phone))
154+
```
155+
156+
> (206) 867-5309\
157+
(206) 666-7777\
158+
(206) 444-3333\
159+
(206) 444-3333\
160+
(206) 943-2222\
161+
(206) 323-6789\
162+
(206) 444-3333\
163+
(206) 867-5309\
164+
(206) 466-2255\
165+
(206) 222-3333
166+
167+
Better, but I don't like those repeated sequences. Like good software developers, let's iterate again!
168+
169+
## Reiteration - with debugging
170+
171+
Here's a fancier regex that should give us more interesting results:
172+
173+
```python
174+
phone_regex_3_error = r'\([0-9]{3}\) [2-4][7-9][4-6]-[3-6][2-8][1-4]'
175+
```
176+
177+
This looks good to me, but there's a subtle bug, that's why we *always* need to validate our structure against real data. This time we'll make our validator do a bit more work to verify the correct string is matched:
178+
179+
```python
180+
if not re.match(phone_regex_3_error, phone_number):
181+
print("Regex fails match")
182+
else:
183+
matched_string = re.match(phone_regex_3_error, phone_number)[0]
184+
if matched_string == phone_number:
185+
print("Successful match")
186+
else:
187+
print(f"Error {matched_string} != {phone_number}")
188+
```
189+
This prints out:
190+
> Error (206) 386-463 != (206) 386-4636
191+
192+
Ah! We were missing the last digit, let's fix that and regenerate:
193+
194+
```python
195+
phone_regex_3_fixed = r'\([0-9]{3}\) [2-4][7-9][4-6]-[3-6][2-8][1-4][6-9]'
196+
phone_generator_v3 = outlines.generate.regex(model,
197+
phone_regex_3_fixed)
198+
for _ in range(10):
199+
print(phone_generator_v3(prompt_phone))
200+
```
201+
202+
>(206) 494-3216\
203+
(206) 374-6218\
204+
(206) 494-3337\
205+
(206) 476-3216\
206+
(206) 484-3548\
207+
(206) 495-3218\
208+
(206) 494-5517\
209+
(206) 375-4636\
210+
(206) 384-6216\
211+
(206) 385-6218
212+
213+
Much better!
214+
215+
Now you've seen a quick example of the structured generation workflow that can be used at the basis for building and iteration on much larger structured generation tasks!

0 commit comments

Comments
 (0)