Skip to content

Commit 384d120

Browse files
committed
eval driven system design cookbook updates
1 parent beffd95 commit 384d120

File tree

6 files changed

+41
-11
lines changed

6 files changed

+41
-11
lines changed

examples/partners/eval_driven_system_design/receipt_inspection.ipynb

Lines changed: 36 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,15 @@
112112
"source": [
113113
"## Project Lifecycle\n",
114114
"\n",
115-
"Not every project will proceed in the same way, but projects generally have some common\n",
116-
"important components.\n",
115+
"Not every project will proceed in the same way, but projects generally have some \n",
116+
"important components in common.\n",
117+
"\n",
118+
"![Project Lifecycle](../../../images/partner_project_lifecycle.png)\n",
119+
"\n",
120+
"The solid arrows show the primary progressions or steps, while the dotted line \n",
121+
"represents the ongoing nature of problem understanding - uncovering more about\n",
122+
"the customer domain will influence every step of the process. We wil examine \n",
123+
"several of these iterative cycles of refinement in detail below. \n",
117124
"\n",
118125
"### 1. Understand the Problem\n",
119126
"\n",
@@ -133,10 +140,11 @@
133140
"It's very rare that a real-world project will start with all the data necessary to get\n",
134141
"to a satisfactory solution, much less to establish confidence.\n",
135142
"\n",
136-
"In our case, we're going to assume that we have a decent sample of system *inputs*\n",
137-
"(here, photographs of receipts), but start without any fully annotated data. We'll walk\n",
138-
"through the process of incrementally expanding our test and training sets as we go along\n",
139-
"and make our evals progressively more comprehensive.\n",
143+
"In our case, we're going to assume that we have a decent sample of system *inputs*, \n",
144+
"in the form of but receipt images, but start without any fully annotated data. We find \n",
145+
"this is a not-unusual situation when automating an existing process. Instead, \n",
146+
"we'll walk through the process of building that out as we go along by collaborating with\n",
147+
"domain experts, and make our evals progressively more comprehensive.\n",
140148
"\n",
141149
"### 3. Build an End-to-End V0 System\n",
142150
"\n",
@@ -394,7 +402,7 @@
394402
"cell_type": "markdown",
395403
"metadata": {},
396404
"source": [
397-
"<img src=\"../../../images/Supplies_20240322_220858_Raven_Scan_3_jpeg.rf.50852940734939c8838819d7795e1756.jpg\" alt=\"Walmart_image\" width=\"400\"/>"
405+
"![Walmart_image](../../../images/Supplies_20240322_220858_Raven_Scan_3_jpeg.rf.50852940734939c8838819d7795e1756.jpg)"
398406
]
399407
},
400408
{
@@ -497,8 +505,22 @@
497505
"source": [
498506
"### Action Decision\n",
499507
"\n",
500-
"Next, we need to close the loop and get to an actual decision based on receipts. This\n",
501-
"looks pretty similar, so we'll present the code without comment."
508+
"Next, we need to close the loop and get to an actual decision based on receipts. \n",
509+
"\n",
510+
"Ordinarily one would start with the most capable model - `o3`, at this time - for a \n",
511+
"first pass, and then once correctness is established experiment with different models\n",
512+
"to analyze any tradeoffs for their business impact, and potentially consider whether \n",
513+
"they are remediable with iteration. A client may be willing to take a certain accuracy \n",
514+
"hit for lower latency or cost, or it may be more effective to change the architecture\n",
515+
"to hit cost, latency, and accuracy goals. We'll get into how to make these tradeoffs\n",
516+
"explicitly and objectively later on. \n",
517+
"\n",
518+
"For this cookbook, `o3` might be too good. We'll use `o4-mini` for our first pass, so \n",
519+
"that we get a few reasoning errors we can use to illustrate the means of addressing\n",
520+
"them when they occur.\n",
521+
"\n",
522+
"Otherwise, this is pretty similar to the last, so we'll present the code without \n",
523+
"further comment."
502524
]
503525
},
504526
{
@@ -887,7 +909,10 @@
887909
"metadata": {},
888910
"source": [
889911
"After you run that eval you'll be able to view it in the UI, and should see something\n",
890-
"like:\n",
912+
"like the below. \n",
913+
"\n",
914+
"(Note, if you have a Zero-Data-Retention agreement, this data is not stored\n",
915+
"by OpenAI, so will not be available in this interface.)\n",
891916
"\n",
892917
"![Summary UI](../../../images/partner_summary_ui.png)\n",
893918
"\n",
@@ -1617,7 +1642,7 @@
16171642
"ARE NOT TRAVEL-RELATED, THEN IT MUST BE AUDITED.\n",
16181643
"```\n",
16191644
"\n",
1620-
"3. We added three examples, JSON input/output pairs wrapped in XML tags.\n",
1645+
"4. We added three examples, JSON input/output pairs wrapped in XML tags.\n",
16211646
"\n",
16221647
"With our prompt revisions, we'll regenerate the data to evaluate and re-run the same\n",
16231648
"eval to compare our results:"
-126 KB
Loading
-317 KB
Loading
7.02 KB
Loading
161 KB
Loading

registry.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,13 @@
99
date: 2025-06-01
1010
authors:
1111
- shikhar-cyber
12+
- moredatarequired
13+
- tooluser
14+
- eddiesiegel
1215
tags:
1316
- evals
17+
- API Flywheel
18+
- completions
1419
- responses
1520
- functions
1621
- tracing

0 commit comments

Comments
 (0)