|
599 | 599 | "\n", |
600 | 600 | "#### 📚 What you'll learn\n", |
601 | 601 | "\n", |
602 | | - "The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic dataset of W-2 forms (US Wage & Tax Statements).\n", |
| 602 | + "The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic\\\n", |
| 603 | + " dataset of W-2 forms (US Wage & Tax Statements).\n", |
603 | 604 | "\n", |
604 | 605 | "- We will use generate numerical fields using [statistics published by the IRS](https://www.irs.gov/pub/irs-pdf/p5385.pdf) for the year 2021:\n", |
605 | 606 | "\n", |
606 | | - "- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics for generated persons reflect real-world census data.\n", |
| 607 | + "- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics\\\n", |
| 608 | + " for generated persons reflect real-world census data.\n", |
607 | 609 | "\n", |
608 | 610 | "\n", |
609 | 611 | "<br>\n", |
610 | 612 | "\n", |
611 | 613 | "> 👋 **IMPORTANT** – Environment Setup\n", |
612 | 614 | ">\n", |
613 | | - "> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n", |
| 615 | + "> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n", |
614 | 616 | ">\n", |
615 | 617 | "> - You may need to restart your notebook's kernel after setting up the environment.\n", |
616 | 618 | "> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n", |
|
755 | 757 | "id": "bbcb3538", |
756 | 758 | "metadata": {}, |
757 | 759 | "source": [ |
758 | | - "### 🎲 Setting Up Taxpayer and Employer Sampling\n", |
| 760 | + "## 🎲 Setting Up Taxpayer and Employer Sampling\n", |
759 | 761 | "\n", |
760 | 762 | "- Sampler columns offer non-LLM based generation of synthetic data.\n", |
761 | 763 | "\n", |
762 | 764 | "- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n", |
763 | 765 | "\n", |
764 | | - "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census. If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" |
| 766 | + "- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n", |
| 767 | + " If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker" |
765 | 768 | ] |
766 | 769 | }, |
767 | 770 | { |
|
800 | 803 | "id": "28397d74", |
801 | 804 | "metadata": {}, |
802 | 805 | "source": [ |
803 | | - "### ⚡️ Defining the Fields\n", |
| 806 | + "## ⚡️ Defining the Fields\n", |
804 | 807 | "\n", |
805 | 808 | "We will focus on the following:\n", |
806 | 809 | "- Box 1 (Wages, tips, and other compensation)\n", |
|
819 | 822 | "\n", |
820 | 823 | "### Numerical fields\n", |
821 | 824 | "\n", |
822 | | - "Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). We'll use the W-2 statistics from the IRS linked above to generate realistic samples." |
| 825 | + "Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). \\\n", |
| 826 | + "We'll use the W-2 statistics from the IRS linked above to generate realistic samples." |
823 | 827 | ] |
824 | 828 | }, |
825 | 829 | { |
|
999 | 1003 | "source": [ |
1000 | 1004 | "### 🦜 Non-numerical Fields\n", |
1001 | 1005 | "\n", |
1002 | | - "The remaining fields contain information about the employee (taxpayer) and the employer. We'll use the person sampler in combination with an LLM to generate values here." |
| 1006 | + "The remaining fields contain information about the employee (taxpayer) and the employer. \\\n", |
| 1007 | + "We'll use the person sampler in combination with an LLM to generate values here." |
1003 | 1008 | ] |
1004 | 1009 | }, |
1005 | 1010 | { |
|
1162 | 1167 | "metadata": {}, |
1163 | 1168 | "outputs": [], |
1164 | 1169 | "source": [ |
1165 | | - "job_results = data_designer_client.create(config_builder, num_records=2)\n", |
| 1170 | + "job_results = data_designer_client.create(config_builder, num_records=20)\n", |
1166 | 1171 | "\n", |
1167 | 1172 | "# This will block until the job is complete.\n", |
1168 | 1173 | "job_results.wait_until_done()" |
|
1206 | 1211 | "# Download the job artifacts and save them to disk.\n", |
1207 | 1212 | "job_results.download_artifacts(\n", |
1208 | 1213 | " output_path=TUTORIAL_OUTPUT_PATH,\n", |
1209 | | - " artifacts_folder_name=\"artifacts-community-contributions-w2-dataset\",\n", |
| 1214 | + " artifacts_folder_name=\"artifacts-community-contributions-forms-w2-dataset\",\n", |
1210 | 1215 | ");" |
1211 | 1216 | ] |
1212 | 1217 | } |
|
0 commit comments