Skip to content

Commit 433c117

Browse files
merged
2 parents 01f96e4 + 56f72c5 commit 433c117

File tree

6 files changed

+536
-74
lines changed

6 files changed

+536
-74
lines changed

authors.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,11 @@ ibigio:
6363
website: "https://twitter.com/ilanbigio"
6464
avatar: "https://pbs.twimg.com/profile_images/1841544725654077440/DR3b8DMr_400x400.jpg"
6565

66+
willhath-openai:
67+
name: "Will Hathaway"
68+
website: "https://www.willhath.com"
69+
avatar: "https://media.licdn.com/dms/image/v2/D4E03AQEHOtMrHtww4Q/profile-displayphoto-shrink_200_200/B4EZRR64p9HgAc-/0/1736541178829?e=2147483647&v=beta&t=w1rX0KhLZaK5qBkVLkJjmYmfNMbsV2Bcn8InFVX9lwI"
70+
6671
jhills20:
6772
name: "James Hills"
6873
website: "https://twitter.com/jamesmhills"
Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Evaluating a new model on existing responses"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"In the following eval, we are going to compare how a new model (gpt-4.1-mini) compares to our old model (gpt-4o-mini) by evaluating it on some stored responses. The benefit of this is for most developers, they won't have to spend any time putting together a whole eval -- all of their data will already be stored in their [logs page](https://platform.openai.com/logs)."
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": 30,
20+
"metadata": {},
21+
"outputs": [],
22+
"source": [
23+
"import openai\n",
24+
"import os\n",
25+
"\n",
26+
"\n",
27+
"client = openai.OpenAI()"
28+
]
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"metadata": {},
33+
"source": [
34+
"We want to see how gpt-4.1 compares to gpt-4o on explaining a code base. Since can only use the responses datasource if you already have user traffic, we're going to generate some example traffic using 4o, and then compare how it does to gpt-4.1. \n",
35+
"\n",
36+
"We're going to get some example code files from the OpenAI SDK, and ask gpt-4o to explain them to us."
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {},
43+
"outputs": [],
44+
"source": [
45+
"openai_sdk_file_path = os.path.dirname(openai.__file__)\n",
46+
"\n",
47+
"# Get some example code files from the OpenAI SDK \n",
48+
"file_paths = [\n",
49+
" os.path.join(openai_sdk_file_path, \"resources\", \"evals\", \"evals.py\"),\n",
50+
" os.path.join(openai_sdk_file_path, \"resources\", \"responses\", \"responses.py\"),\n",
51+
" os.path.join(openai_sdk_file_path, \"resources\", \"images.py\"),\n",
52+
" os.path.join(openai_sdk_file_path, \"resources\", \"embeddings.py\"),\n",
53+
" os.path.join(openai_sdk_file_path, \"resources\", \"files.py\"),\n",
54+
"]\n",
55+
"\n",
56+
"print(file_paths[0])\n"
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"Now, lets generate some responses. "
64+
]
65+
},
66+
{
67+
"cell_type": "code",
68+
"execution_count": null,
69+
"metadata": {},
70+
"outputs": [],
71+
"source": [
72+
"for file_path in file_paths:\n",
73+
" response = client.responses.create(\n",
74+
" input=[\n",
75+
" {\"role\": \"user\",\n",
76+
" \"content\": [\n",
77+
" {\n",
78+
" \"type\": \"input_text\",\n",
79+
" \"text\": \"What does this file do?\"\n",
80+
" },\n",
81+
" {\n",
82+
" \"type\": \"input_text\",\n",
83+
" \"text\": open(file_path, \"r\").read(),\n",
84+
" },\n",
85+
" ]},\n",
86+
" ],\n",
87+
" model=\"gpt-4o-mini\",\n",
88+
" )\n",
89+
" print(response.output_text)"
90+
]
91+
},
92+
{
93+
"cell_type": "markdown",
94+
"metadata": {},
95+
"source": [
96+
"Note that in order for this to work, you'll have to be doing this on an org where data logging isn't disabled (through zdr, etc). If you aren't sure if this is the case for you, go to https://platform.openai.com/logs?api=responses and see if you can see the responses you just generated."
97+
]
98+
},
99+
{
100+
"cell_type": "code",
101+
"execution_count": 31,
102+
"metadata": {},
103+
"outputs": [],
104+
"source": [
105+
"grader_system_prompt = \"\"\"\n",
106+
"You are **Code-Explanation Grader**, an expert software engineer and technical writer. \n",
107+
"Your job is to score how well *Model A* explained the purpose and behaviour of a given source-code file.\n",
108+
"\n",
109+
"### What you receive\n",
110+
"1. **File contents** – the full text of the code file (or a representative excerpt). \n",
111+
"2. **Candidate explanation** – the answer produced by Model A that tries to describe what the file does.\n",
112+
"\n",
113+
"### What to produce\n",
114+
"Return a single JSON object that can be parsed by `json.loads`, containing:\n",
115+
"```json\n",
116+
"{\n",
117+
" \"steps\": [\n",
118+
" { \"description\": \"...\", \"result\": \"float\" },\n",
119+
" { \"description\": \"...\", \"result\": \"float\" },\n",
120+
" { \"description\": \"...\", \"result\": \"float\" }\n",
121+
" ],\n",
122+
" \"result\": \"float\"\n",
123+
"}\n",
124+
"```\n",
125+
"• Each object in `steps` documents your reasoning for one category listed under “Scoring dimensions”. \n",
126+
"• Place your final 1 – 7 quality score (inclusive) in the top-level `result` key as a **string** (e.g. `\"5.5\"`).\n",
127+
"\n",
128+
"### Scoring dimensions (evaluate in this order)\n",
129+
"\n",
130+
"1. **Correctness & Accuracy ≈ 45 %** \n",
131+
" • Does the explanation match the actual code behaviour, interfaces, edge cases, and side effects? \n",
132+
" • Fact-check every technical claim; penalise hallucinations or missed key functionality.\n",
133+
"\n",
134+
"2. **Completeness & Depth ≈ 25 %** \n",
135+
" • Are all major components, classes, functions, data flows, and external dependencies covered? \n",
136+
" • Depth should be appropriate to the file’s size/complexity; superficial glosses lose points.\n",
137+
"\n",
138+
"3. **Clarity & Organization ≈ 20 %** \n",
139+
" • Is the explanation well-structured, logically ordered, and easy for a competent developer to follow? \n",
140+
" • Good use of headings, bullet lists, and concise language is rewarded.\n",
141+
"\n",
142+
"4. **Insight & Usefulness ≈ 10 %** \n",
143+
" • Does the answer add valuable context (e.g., typical use cases, performance notes, risks) beyond line-by-line paraphrase? \n",
144+
" • Highlighting **why** design choices matter is a plus.\n",
145+
"\n",
146+
"### Error taxonomy\n",
147+
"• **Major error** – Any statement that materially misrepresents the file (e.g., wrong API purpose, inventing non-existent behaviour). \n",
148+
"• **Minor error** – Small omission or wording that slightly reduces clarity but doesn’t mislead. \n",
149+
"List all found errors in your `steps` reasoning.\n",
150+
"\n",
151+
"### Numeric rubric\n",
152+
"1 Catastrophically wrong; mostly hallucination or irrelevant. \n",
153+
"2 Many major errors, few correct points. \n",
154+
"3 Several major errors OR pervasive minor mistakes; unreliable. \n",
155+
"4 Mostly correct but with at least one major gap or multiple minors; usable only with caution. \n",
156+
"5 Solid, generally correct; minor issues possible but no major flaws. \n",
157+
"6 Comprehensive, accurate, and clear; only very small nit-picks. \n",
158+
"7 Exceptional: precise, thorough, insightful, and elegantly presented; hard to improve.\n",
159+
"\n",
160+
"Use the full scale. Reserve 6.5 – 7 only when you are almost certain the explanation is outstanding.\n",
161+
"\n",
162+
"Then set `\"result\": \"4.0\"` (example).\n",
163+
"\n",
164+
"Be rigorous and unbiased.\n",
165+
"\"\"\"\n",
166+
"user_input_message = \"\"\"**User input**\n",
167+
"\n",
168+
"{{item.input}}\n",
169+
"\n",
170+
"**Response to evaluate**\n",
171+
"\n",
172+
"{{sample.output_text}}\n",
173+
"\"\"\""
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": 25,
179+
"metadata": {},
180+
"outputs": [],
181+
"source": [
182+
"logs_eval = client.evals.create(\n",
183+
" name=\"Code QA Eval\",\n",
184+
" data_source_config={\n",
185+
" \"type\": \"logs\",\n",
186+
" },\n",
187+
" testing_criteria=[\n",
188+
" {\n",
189+
"\t\t\t\"type\": \"score_model\",\n",
190+
" \"name\": \"General Evaluator\",\n",
191+
" \"model\": \"o3\",\n",
192+
" \"input\": [{\n",
193+
" \"role\": \"system\",\n",
194+
" \"content\": grader_system_prompt,\n",
195+
" }, {\n",
196+
" \"role\": \"user\",\n",
197+
" \"content\": user_input_message,\n",
198+
" },\n",
199+
" ],\n",
200+
" \"range\": [1, 7],\n",
201+
" \"pass_threshold\": 5.5,\n",
202+
" }\n",
203+
" ]\n",
204+
")"
205+
]
206+
},
207+
{
208+
"cell_type": "markdown",
209+
"metadata": {},
210+
"source": [
211+
"First, lets kick off a run to evaluate how good the original responses were. To do this, we just set the filters for what responses we want to evaluate on"
212+
]
213+
},
214+
{
215+
"cell_type": "code",
216+
"execution_count": 26,
217+
"metadata": {},
218+
"outputs": [],
219+
"source": [
220+
"gpt_4o_mini_run = client.evals.runs.create(\n",
221+
" name=\"gpt-4o-mini\",\n",
222+
" eval_id=logs_eval.id,\n",
223+
" data_source={\n",
224+
" \"type\": \"responses\",\n",
225+
" \"source\": {\"type\": \"responses\", \"limit\": len(file_paths)}, # just grab the most recent responses\n",
226+
" },\n",
227+
")"
228+
]
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"metadata": {},
233+
"source": [
234+
"Now, let's see how 4.1-mini does!"
235+
]
236+
},
237+
{
238+
"cell_type": "code",
239+
"execution_count": 27,
240+
"metadata": {},
241+
"outputs": [],
242+
"source": [
243+
"gpt_41_mini_run = client.evals.runs.create(\n",
244+
" name=\"gpt-4.1-mini\",\n",
245+
" eval_id=logs_eval.id,\n",
246+
" data_source={\n",
247+
" \"type\": \"responses\",\n",
248+
" \"source\": {\"type\": \"responses\", \"limit\": len(file_paths)},\n",
249+
" \"input_messages\": {\n",
250+
" \"type\": \"item_reference\",\n",
251+
" \"item_reference\": \"item.input\",\n",
252+
" },\n",
253+
" \"model\": \"gpt-4.1-mini\",\n",
254+
" }\n",
255+
")"
256+
]
257+
},
258+
{
259+
"cell_type": "markdown",
260+
"metadata": {},
261+
"source": [
262+
"Now, lets go to the dashboard to see how we did!"
263+
]
264+
},
265+
{
266+
"cell_type": "code",
267+
"execution_count": null,
268+
"metadata": {},
269+
"outputs": [],
270+
"source": [
271+
"gpt_4o_mini_run.report_url"
272+
]
273+
},
274+
{
275+
"cell_type": "markdown",
276+
"metadata": {},
277+
"source": []
278+
}
279+
],
280+
"metadata": {
281+
"kernelspec": {
282+
"display_name": "Python 3",
283+
"language": "python",
284+
"name": "python3"
285+
},
286+
"language_info": {
287+
"codemirror_mode": {
288+
"name": "ipython",
289+
"version": 3
290+
},
291+
"file_extension": ".py",
292+
"mimetype": "text/x-python",
293+
"name": "python",
294+
"nbconvert_exporter": "python",
295+
"pygments_lexer": "ipython3",
296+
"version": "3.11.8"
297+
}
298+
},
299+
"nbformat": 4,
300+
"nbformat_minor": 2
301+
}

examples/gpt4-1_prompting_guide.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -577,11 +577,11 @@
577577
"Guidance specifically for adding a large number of documents or files to input context:\n",
578578
"\n",
579579
"* XML performed well in our long context testing. \n",
580-
" * Example: `<doc id="1" title="The Fox">The quick brown fox jumps over the lazy dog</doc>` \n",
580+
" * Example: `<doc id='1' title='The Fox'>The quick brown fox jumps over the lazy dog</doc>` \n",
581581
"* This format, proposed by Lee et al. ([ref](https://arxiv.org/pdf/2406.13121)), also performed well in our long context testing. \n",
582582
" * Example: `ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy dog` \n",
583583
"* JSON performed particularly poorly. \n",
584-
" * Example: `[{"id": 1, "title": "The Fox", "content": "The quick brown fox jumped over the lazy dog"}]`\n",
584+
" * Example: `[{'id': 1, 'title': 'The Fox', 'content': 'The quick brown fox jumped over the lazy dog'}]`\n",
585585
"\n",
586586
"The model is trained to robustly understand structure in a variety of formats. Generally, use your judgement and think about what will provide clear information and “stand out” to the model. For example, if you’re retrieving documents that contain lots of XML, an XML-based delimiter will likely be less effective. \n",
587587
"\n",

0 commit comments

Comments
 (0)