Evals lab instructions

siddhi · siddhi · commit 1cde4d066e17 · 2025-12-05T00:48:57.000+05:30
diff --git a/labs/09-eval-harness.md b/labs/09-eval-harness.md
@@ -23,6 +23,20 @@ Here are the steps for this lab:
 1. Create a `Fixed Description` column with all the revised descriptions
 1. Save the dataframe to `eval.output.csv`
 
+Now we are going to do "Open Coding"
+
+1. First discuss with your partner and come up with a common list of expectations -- what are you expecting in a correct output? Decide on 5 points to look for.
+1. Open the CSV in excel
+1. Read each row of input and output
+1. In the last column, write your feedback of the output
+
+Last step is "Axial Coding"
+
+1. Look through your excel sheet feedback and see if there is an error which is repeated in 2-3 rows
+1. This is the failure mode you need to fix in your app
+
+At this step you will make some changes to your application, experiment with some prompt changes to fix the failure mode, then repeat the process to see what has changed
+
 ## Hints
 
 ### How do I read a csv?
diff --git a/labs/10-synthetic-data.md b/labs/10-synthetic-data.md
@@ -0,0 +1,66 @@
+# Lab 10: Synthetic Data Generation
+
+We have manually selected a few examples for our test samples, but that is not enough. Ideally we want around 100 test samples.
+
+## Step 1: Scenario Creation
+
+The evaluation samples that we have are mainly developer roles. This is not diverse enough. We might have all sorts of jobs on our system, like teachers or factory supervisors. We need to ensure our test sample set is diverse so we can evaluate with many different possibilities.
+
+To do that, we need to come up with the different dimensions that might affect the job descriptions.
+
+For example, the Industry might be one dimension -- It could be Technology, Education, Manufacturing, Marketing etc
+
+Another dimension might be the length of the description: less than 100 words, 100 to 500 words, more than 500 words
+
+Like this, there will be many dimensions, each dimension will have some possible values.
+
+1. Normally, we will talk to the customer / user / domain expert to understand this better. For this lab, discuss with your partner and come up with three more dimensions along with 3-5 values per dimension
+1. Then randomly select about 50 combinations of samples. Example, one combination may be: (Marketing, less than 100 words, X, Y, Z). You can prompt ChatGPT to generate these combinations if you want
+
+## Step 2: Data Generation
+
+Now that we have the list of tuples, we need to ask ChatGPT to generate one job description for each tuple. It has to generate the job description that follows the style given in the tuple
+
+By the end, we should have 50 synthetic data samples
+
+## Hints
+
+### What prompt can I give ChatGPT to generate the scenarios?
+
+<details>
+<summary>Answer</summary>
+
+```
+I am designing a Application Tracking System and want to test it with a diverse set of user scenarios. Please generate 50 unique combinations (tuples) using the following key dimensions and their possible values: 
+
+- Industry: Technology, Marketing, Manufacturing, Teaching, Medicine, Shipping 
+- Length: Less than 100 words, 100 to 500 words, more than 500 words 
+- Language: Easy to understand, biased language, confusing, jargon heavy 
+- Type: Onsite, Remote, Hybrid 
+- Seniority: Fresher, Mid Level, Executive 
+
+Each combination should select one value from each dimension. Present the results as a list of tuples, where each tuple contains one value for each dimension in the following order: (Industry, Length, Language, Type, Seniority). Ensure that the combinations are varied and realistic.
+```
+</details>
+
+### What prompt can I give ChatGPT to generate the synthetic data?
+
+<details>
+<summary>Answer</summary>
+
+```
+Convert these dimension combinations into realistic job descriptions for an application tracking system. 
+
+Include variations in: 
+- Structuring (Free text vs structured with headings) 
+- Common typos 
+- Natural language patterns 
+- Realistic context and urgency 
+
+Include only 1 example per dimension_example. 
+
+<dimension_examples>
+{put the tuples here}
+</dimension_examples>
+```
+</details>
diff --git a/labs/11-vibe-coding.md b/labs/11-vibe-coding.md
@@ -0,0 +1,19 @@
+# Lab 11: Vibe Coding an Annotation Tool
+
+Did you find it tiring to do the open coding exercise with the Excel sheet?
+
+Given that developers and domain experts will be repeatedly doing the same exercise over and over again, over 100 or more samples, then it makes sense to have an interface that makes it easy to do this.
+
+In this lab, we will build a tool to make it easier to do annotations. Because this is an internal tool that won't go to production, we will just vibe code it!
+
+It needs to:
+
+- Read the CSV of eval input and outputs
+- Display it on the screen in a way that is easy to read and evaluate
+- Allow the annotator to mark "pass" or "fail". If "fail" they can give a reason
+- Show on the UI which samples have been annotated and how many are remaining
+- Automatically move to the next sample after a sample has been annotated
+
+This is a free form lab. Just open the vibe coding tool that you like, and have fun!
+
+At the end, we will all show 5 minute demos of the app that we built.