Skip to content

Commit 0a6dbb1

Browse files
committed
modified healthcare notebooks for 25.10
1 parent e74b70d commit 0a6dbb1

File tree

4 files changed

+2221
-1696
lines changed

4 files changed

+2221
-1696
lines changed

nemo/NeMo-Data-Designer/self-hosted-tutorials/community-contributions/forms/w2-dataset.ipynb

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -599,18 +599,20 @@
599599
"\n",
600600
"#### 📚 What you'll learn\n",
601601
"\n",
602-
"The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic dataset of W-2 forms (US Wage & Tax Statements).\n",
602+
"The notebook demonstrates how you can combine numerical samplers, the person sampler and LLMs to create a synthetic\\\n",
603+
" dataset of W-2 forms (US Wage & Tax Statements).\n",
603604
"\n",
604605
"- We will use generate numerical fields using [statistics published by the IRS](https://www.irs.gov/pub/irs-pdf/p5385.pdf) for the year 2021:\n",
605606
"\n",
606-
"- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics for generated persons reflect real-world census data.\n",
607+
"- We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, statistics\\\n",
608+
" for generated persons reflect real-world census data.\n",
607609
"\n",
608610
"\n",
609611
"<br>\n",
610612
"\n",
611613
"> 👋 **IMPORTANT** – Environment Setup\n",
612614
">\n",
613-
"> - If you haven't already, follow the instructions in the [README](../README.md) to install the necessary dependencies.\n",
615+
"> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.\n",
614616
">\n",
615617
"> - You may need to restart your notebook's kernel after setting up the environment.\n",
616618
"> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.\n",
@@ -755,13 +757,14 @@
755757
"id": "bbcb3538",
756758
"metadata": {},
757759
"source": [
758-
"### 🎲 Setting Up Taxpayer and Employer Sampling\n",
760+
"## 🎲 Setting Up Taxpayer and Employer Sampling\n",
759761
"\n",
760762
"- Sampler columns offer non-LLM based generation of synthetic data.\n",
761763
"\n",
762764
"- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.\n",
763765
"\n",
764-
"- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census. If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker"
766+
"- The persona samplers allow you to sample realistic details of individuals using a model trained on the US Census.\\\n",
767+
" If the locale of the persona you are generating is anything other than `en_US`, then the personas will be generated using Faker"
765768
]
766769
},
767770
{
@@ -800,7 +803,7 @@
800803
"id": "28397d74",
801804
"metadata": {},
802805
"source": [
803-
"### ⚡️ Defining the Fields\n",
806+
"## ⚡️ Defining the Fields\n",
804807
"\n",
805808
"We will focus on the following:\n",
806809
"- Box 1 (Wages, tips, and other compensation)\n",
@@ -819,7 +822,8 @@
819822
"\n",
820823
"### Numerical fields\n",
821824
"\n",
822-
"Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). We'll use the W-2 statistics from the IRS linked above to generate realistic samples."
825+
"Here, we'll define how to generate numerical samples for the currency fields of the W-2 (Boxes 1-7). \\\n",
826+
"We'll use the W-2 statistics from the IRS linked above to generate realistic samples."
823827
]
824828
},
825829
{
@@ -999,7 +1003,8 @@
9991003
"source": [
10001004
"### 🦜 Non-numerical Fields\n",
10011005
"\n",
1002-
"The remaining fields contain information about the employee (taxpayer) and the employer. We'll use the person sampler in combination with an LLM to generate values here."
1006+
"The remaining fields contain information about the employee (taxpayer) and the employer. \\\n",
1007+
"We'll use the person sampler in combination with an LLM to generate values here."
10031008
]
10041009
},
10051010
{
@@ -1162,7 +1167,7 @@
11621167
"metadata": {},
11631168
"outputs": [],
11641169
"source": [
1165-
"job_results = data_designer_client.create(config_builder, num_records=2)\n",
1170+
"job_results = data_designer_client.create(config_builder, num_records=20)\n",
11661171
"\n",
11671172
"# This will block until the job is complete.\n",
11681173
"job_results.wait_until_done()"
@@ -1206,7 +1211,7 @@
12061211
"# Download the job artifacts and save them to disk.\n",
12071212
"job_results.download_artifacts(\n",
12081213
" output_path=TUTORIAL_OUTPUT_PATH,\n",
1209-
" artifacts_folder_name=\"artifacts-community-contributions-w2-dataset\",\n",
1214+
" artifacts_folder_name=\"artifacts-community-contributions-forms-w2-dataset\",\n",
12101215
");"
12111216
]
12121217
}

0 commit comments

Comments
 (0)