|
| 1 | +# Lab 10: Synthetic Data Generation |
| 2 | + |
| 3 | +We have manually selected a few examples for our test samples, but that is not enough. Ideally we want around 100 test samples. |
| 4 | + |
| 5 | +## Step 1: Scenario Creation |
| 6 | + |
| 7 | +The evaluation samples that we have are mainly developer roles. This is not diverse enough. We might have all sorts of jobs on our system, like teachers or factory supervisors. We need to ensure our test sample set is diverse so we can evaluate with many different possibilities. |
| 8 | + |
| 9 | +To do that, we need to come up with the different dimensions that might affect the job descriptions. |
| 10 | + |
| 11 | +For example, the Industry might be one dimension -- It could be Technology, Education, Manufacturing, Marketing etc |
| 12 | + |
| 13 | +Another dimension might be the length of the description: less than 100 words, 100 to 500 words, more than 500 words |
| 14 | + |
| 15 | +Like this, there will be many dimensions, each dimension will have some possible values. |
| 16 | + |
| 17 | +1. Normally, we will talk to the customer / user / domain expert to understand this better. For this lab, discuss with your partner and come up with three more dimensions along with 3-5 values per dimension |
| 18 | +1. Then randomly select about 50 combinations of samples. Example, one combination may be: (Marketing, less than 100 words, X, Y, Z). You can prompt ChatGPT to generate these combinations if you want |
| 19 | + |
| 20 | +## Step 2: Data Generation |
| 21 | + |
| 22 | +Now that we have the list of tuples, we need to ask ChatGPT to generate one job description for each tuple. It has to generate the job description that follows the style given in the tuple |
| 23 | + |
| 24 | +By the end, we should have 50 synthetic data samples |
| 25 | + |
| 26 | +## Hints |
| 27 | + |
| 28 | +### What prompt can I give ChatGPT to generate the scenarios? |
| 29 | + |
| 30 | +<details> |
| 31 | +<summary>Answer</summary> |
| 32 | + |
| 33 | +``` |
| 34 | +I am designing a Application Tracking System and want to test it with a diverse set of user scenarios. Please generate 50 unique combinations (tuples) using the following key dimensions and their possible values: |
| 35 | +
|
| 36 | +- Industry: Technology, Marketing, Manufacturing, Teaching, Medicine, Shipping |
| 37 | +- Length: Less than 100 words, 100 to 500 words, more than 500 words |
| 38 | +- Language: Easy to understand, biased language, confusing, jargon heavy |
| 39 | +- Type: Onsite, Remote, Hybrid |
| 40 | +- Seniority: Fresher, Mid Level, Executive |
| 41 | +
|
| 42 | +Each combination should select one value from each dimension. Present the results as a list of tuples, where each tuple contains one value for each dimension in the following order: (Industry, Length, Language, Type, Seniority). Ensure that the combinations are varied and realistic. |
| 43 | +``` |
| 44 | +</details> |
| 45 | + |
| 46 | +### What prompt can I give ChatGPT to generate the synthetic data? |
| 47 | + |
| 48 | +<details> |
| 49 | +<summary>Answer</summary> |
| 50 | + |
| 51 | +``` |
| 52 | +Convert these dimension combinations into realistic job descriptions for an application tracking system. |
| 53 | +
|
| 54 | +Include variations in: |
| 55 | +- Structuring (Free text vs structured with headings) |
| 56 | +- Common typos |
| 57 | +- Natural language patterns |
| 58 | +- Realistic context and urgency |
| 59 | +
|
| 60 | +Include only 1 example per dimension_example. |
| 61 | +
|
| 62 | +<dimension_examples> |
| 63 | +{put the tuples here} |
| 64 | +</dimension_examples> |
| 65 | +``` |
| 66 | +</details> |
0 commit comments