Skip to content

Commit 1cde4d0

Browse files
committed
Evals lab instructions
1 parent b0fa1cb commit 1cde4d0

File tree

3 files changed

+99
-0
lines changed

3 files changed

+99
-0
lines changed

labs/09-eval-harness.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,20 @@ Here are the steps for this lab:
2323
1. Create a `Fixed Description` column with all the revised descriptions
2424
1. Save the dataframe to `eval.output.csv`
2525

26+
Now we are going to do "Open Coding"
27+
28+
1. First discuss with your partner and come up with a common list of expectations -- what are you expecting in a correct output? Decide on 5 points to look for.
29+
1. Open the CSV in excel
30+
1. Read each row of input and output
31+
1. In the last column, write your feedback of the output
32+
33+
Last step is "Axial Coding"
34+
35+
1. Look through your excel sheet feedback and see if there is an error which is repeated in 2-3 rows
36+
1. This is the failure mode you need to fix in your app
37+
38+
At this step you will make some changes to your application, experiment with some prompt changes to fix the failure mode, then repeat the process to see what has changed
39+
2640
## Hints
2741

2842
### How do I read a csv?

labs/10-synthetic-data.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Lab 10: Synthetic Data Generation
2+
3+
We have manually selected a few examples for our test samples, but that is not enough. Ideally we want around 100 test samples.
4+
5+
## Step 1: Scenario Creation
6+
7+
The evaluation samples that we have are mainly developer roles. This is not diverse enough. We might have all sorts of jobs on our system, like teachers or factory supervisors. We need to ensure our test sample set is diverse so we can evaluate with many different possibilities.
8+
9+
To do that, we need to come up with the different dimensions that might affect the job descriptions.
10+
11+
For example, the Industry might be one dimension -- It could be Technology, Education, Manufacturing, Marketing etc
12+
13+
Another dimension might be the length of the description: less than 100 words, 100 to 500 words, more than 500 words
14+
15+
Like this, there will be many dimensions, each dimension will have some possible values.
16+
17+
1. Normally, we will talk to the customer / user / domain expert to understand this better. For this lab, discuss with your partner and come up with three more dimensions along with 3-5 values per dimension
18+
1. Then randomly select about 50 combinations of samples. Example, one combination may be: (Marketing, less than 100 words, X, Y, Z). You can prompt ChatGPT to generate these combinations if you want
19+
20+
## Step 2: Data Generation
21+
22+
Now that we have the list of tuples, we need to ask ChatGPT to generate one job description for each tuple. It has to generate the job description that follows the style given in the tuple
23+
24+
By the end, we should have 50 synthetic data samples
25+
26+
## Hints
27+
28+
### What prompt can I give ChatGPT to generate the scenarios?
29+
30+
<details>
31+
<summary>Answer</summary>
32+
33+
```
34+
I am designing a Application Tracking System and want to test it with a diverse set of user scenarios. Please generate 50 unique combinations (tuples) using the following key dimensions and their possible values:
35+
36+
- Industry: Technology, Marketing, Manufacturing, Teaching, Medicine, Shipping
37+
- Length: Less than 100 words, 100 to 500 words, more than 500 words
38+
- Language: Easy to understand, biased language, confusing, jargon heavy
39+
- Type: Onsite, Remote, Hybrid
40+
- Seniority: Fresher, Mid Level, Executive
41+
42+
Each combination should select one value from each dimension. Present the results as a list of tuples, where each tuple contains one value for each dimension in the following order: (Industry, Length, Language, Type, Seniority). Ensure that the combinations are varied and realistic.
43+
```
44+
</details>
45+
46+
### What prompt can I give ChatGPT to generate the synthetic data?
47+
48+
<details>
49+
<summary>Answer</summary>
50+
51+
```
52+
Convert these dimension combinations into realistic job descriptions for an application tracking system.
53+
54+
Include variations in:
55+
- Structuring (Free text vs structured with headings)
56+
- Common typos
57+
- Natural language patterns
58+
- Realistic context and urgency
59+
60+
Include only 1 example per dimension_example.
61+
62+
<dimension_examples>
63+
{put the tuples here}
64+
</dimension_examples>
65+
```
66+
</details>

labs/11-vibe-coding.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Lab 11: Vibe Coding an Annotation Tool
2+
3+
Did you find it tiring to do the open coding exercise with the Excel sheet?
4+
5+
Given that developers and domain experts will be repeatedly doing the same exercise over and over again, over 100 or more samples, then it makes sense to have an interface that makes it easy to do this.
6+
7+
In this lab, we will build a tool to make it easier to do annotations. Because this is an internal tool that won't go to production, we will just vibe code it!
8+
9+
It needs to:
10+
11+
- Read the CSV of eval input and outputs
12+
- Display it on the screen in a way that is easy to read and evaluate
13+
- Allow the annotator to mark "pass" or "fail". If "fail" they can give a reason
14+
- Show on the UI which samples have been annotated and how many are remaining
15+
- Automatically move to the next sample after a sample has been annotated
16+
17+
This is a free form lab. Just open the vibe coding tool that you like, and have fun!
18+
19+
At the end, we will all show 5 minute demos of the app that we built.

0 commit comments

Comments
 (0)