Skip to content

Commit a65f266

Browse files
committed
Updated person sampling docs based on feedback
1 parent d2715c5 commit a65f266

File tree

1 file changed

+24
-16
lines changed

1 file changed

+24
-16
lines changed

docs/concepts/person_sampling.md

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,21 @@ Person sampling in Data Designer allows you to generate synthetic person data fo
66

77
Data Designer provides two ways to generate synthetic people:
88

9-
1. **Faker-based sampling** - Quick, basic PII generation for testing
9+
1. **Faker-based sampling** - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
1010
2. **Nemotron Personas datasets** - Demographically accurate, rich persona data
1111

1212
---
1313

1414
## Approach 1: Faker-Based Sampling
1515

1616
### What It Does
17-
Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing and prototyping.
17+
Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case.
1818

1919
### Features
20-
- Leverages all PII data features that Faker exposes
20+
- Gives you access to person attributes that Faker exposes
2121
- Quick to set up with no additional downloads
2222
- Generates random names, emails, addresses, phone numbers, etc.
23+
- Supports [all Faker-supported locales](https://faker.readthedocs.io/en/master/locales.html)
2324
- **Not demographically grounded** - data patterns don't reflect real-world demographics
2425

2526
### Usage Example
@@ -35,21 +36,25 @@ config_builder.add_column(
3536
name="customer",
3637
sampler_type=SamplerType.PERSON_FROM_FAKER,
3738
params=PersonFromFakerSamplerParams(
38-
locale="en_US", # Any Faker-supported locale
39-
age_range=[25, 65], # Optional: filter by age range
40-
sex="Female", # Optional: filter by sex ("Male" or "Female")
39+
locale="en_US",
40+
age_range=[25, 65],
41+
sex="Female",
4142
),
4243
)
4344
)
4445
```
4546

47+
See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
48+
4649
---
4750

4851
## Approach 2: Nemotron Personas Datasets
4952

5053
### What It Does
5154
Uses curated Nemotron Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
5255

56+
The NGC datasets are extended versions of the [open-source Nemotron Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality.
57+
5358
### Features
5459
- **Demographically accurate personal details**: Names, ages, sex, marital status, education, occupation based on census data
5560
- **Rich persona details**: Comprehensive behavioral profiles including:
@@ -73,14 +78,15 @@ export NGC_API_KEY="your-ngc-api-key-here"
7378
```
7479

7580
#### Step 2: Download Nemotron Personas Datasets
76-
Use the Data Designer CLI to download the datasets:
81+
Use the NGC CLI to download the datasets:
7782
```bash
78-
ngc registry resource download-version "nvidia/nemo-microservices/nemotron-personas-dataset-en_us:0.0.6"
83+
ngc registry resource download-version "nvidia/nemo-microservices/nemotron-personas-dataset-en_us"
7984
```
8085

81-
This will save the datasets to:
82-
```
83-
~/.data-designer/managed-assets/datasets/
86+
Then move the downloaded dataset to the Data Designer managed assets directory:
87+
```bash
88+
mkdir -p ~/.data-designer/managed-assets/datasets/
89+
mv nemotron-personas-dataset-en_us_* ~/.data-designer/managed-assets/datasets/
8490
```
8591

8692
#### Step 3: Use PersonSampler in Your Code
@@ -96,15 +102,17 @@ config_builder.add_column(
96102
name="customer",
97103
sampler_type=SamplerType.PERSON,
98104
params=PersonSamplerParams(
99-
locale="en_US", # Required: must be one of the managed dataset locales
100-
sex="Female", # Optional: filter by sex ("Male" or "Female")
101-
age_range=[25, 45], # Optional: filter by age range
102-
with_synthetic_personas=True, # Optional: enable rich persona details
105+
locale="en_US",
106+
sex="Female",
107+
age_range=[25, 45],
108+
with_synthetic_personas=True,
103109
),
104110
)
105111
)
106112
```
107113

114+
See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
115+
108116
### Available Data Fields
109117

110118
**Core Fields (all locales):**
@@ -131,7 +139,7 @@ config_builder.add_column(
131139
| `bachelors_field` | string or None | |
132140
| `occupation` | string or None | |
133141
| `email_address` | string | |
134-
| `national_id` | string | SSN for US locale |
142+
| `national_id` | string |
135143

136144
**Japan-Specific Fields (`ja_JP`):**
137145
- `area`

0 commit comments

Comments
 (0)