Skip to content

Commit 8d7a073

Browse files
kirit93johnnygreco
andauthored
docs: Updated Person Sampling docs (#120)
* Updated Person Sampling docs * Updated mv command * Removed versions * Updated mv command --------- Co-authored-by: Johnny Greco <[email protected]>
1 parent 48fdc8c commit 8d7a073

File tree

1 file changed

+151
-2
lines changed

1 file changed

+151
-2
lines changed

docs/concepts/person_sampling.md

Lines changed: 151 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,17 @@
11
# Person Sampling in Data Designer
22

3-
Person sampling in Data Designer allows you to generate synthetic person data for your datasets using the Faker library.
3+
Person sampling in Data Designer allows you to generate synthetic person data for your datasets. There are two distinct approaches, each with different capabilities and use cases.
44

5-
## Faker-Based Sampling
5+
## Overview
6+
7+
Data Designer provides two ways to generate synthetic people:
8+
9+
1. **Faker-based sampling** - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
10+
2. **Nemotron Personas datasets** - Demographically accurate, rich persona data
11+
12+
---
13+
14+
## Approach 1: Faker-Based Sampling
615

716
### What It Does
817
Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case.
@@ -34,3 +43,143 @@ config_builder.add_column(
3443
)
3544
)
3645
```
46+
47+
See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
48+
49+
---
50+
51+
## Approach 2: Nemotron Personas Datasets
52+
53+
### What It Does
54+
Uses curated Nemotron Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
55+
56+
The NGC datasets are extended versions of the [open-source Nemotron Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality.
57+
58+
### Features
59+
- **Demographically accurate personal details**: Names, ages, sex, marital status, education, occupation based on census data
60+
- **Rich persona details**: Comprehensive behavioral profiles including:
61+
- Big Five personality traits with scores
62+
- Cultural backgrounds and narratives
63+
- Skills and hobbies
64+
- Career goals and aspirations
65+
- Context-specific personas (professional, financial, healthcare, sports, arts, travel, culinary, etc.)
66+
- Consistent, referenceable attributes across your dataset
67+
- Grounded in real-world demographic distributions
68+
69+
### Prerequisites
70+
71+
You need to download the Nemotron Personas datasets that you want to use from NGC, they are available [here](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas)
72+
73+
1. **NGC API Key**: Obtain from [NVIDIA GPU Cloud](https://ngc.nvidia.com/)
74+
2. **NGC CLI**: [NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli)
75+
76+
### Setup Instructions
77+
78+
#### Step 1: Set Your NGC API Key
79+
```bash
80+
export NGC_API_KEY="your-ngc-api-key-here"
81+
```
82+
83+
#### Step 2: Download Nemotron Personas Datasets
84+
Use the NGC CLI to download the datasets:
85+
```bash
86+
# For Nemotron Personas USA
87+
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us"
88+
89+
# For Nemotron Personas IN
90+
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in"
91+
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in"
92+
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in"
93+
94+
# For Nemotron Personas JP
95+
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp"
96+
```
97+
98+
Then move the downloaded dataset to the Data Designer managed assets directory:
99+
```bash
100+
mkdir -p ~/.data-designer/managed-assets/datasets/
101+
mv nemotron-personas-dataset-*/*.parquet ~/.data-designer/managed-assets/datasets/
102+
```
103+
104+
#### Step 3: Use PersonSampler in Your Code
105+
```python
106+
from data_designer.essentials import (
107+
SamplerColumnConfig,
108+
SamplerType,
109+
PersonSamplerParams,
110+
)
111+
112+
config_builder.add_column(
113+
SamplerColumnConfig(
114+
name="customer",
115+
sampler_type=SamplerType.PERSON,
116+
params=PersonSamplerParams(
117+
locale="en_US",
118+
sex="Female",
119+
age_range=[25, 45],
120+
with_synthetic_personas=True,
121+
),
122+
)
123+
)
124+
```
125+
126+
See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
127+
128+
### Available Data Fields
129+
130+
**Core Fields (all locales):**
131+
132+
| Field | Type | Notes |
133+
|-------|------|-------|
134+
| `uuid` | UUID | Unique identifier |
135+
| `first_name` | string | |
136+
| `middle_name` | string | |
137+
| `last_name` | string | |
138+
| `sex` | enum | "Male" or "Female" |
139+
| `birth_date` | date | Derived: year, month, day |
140+
| `street_number` | int | |
141+
| `street_name` | string | |
142+
| `unit` | string | Address line 2 |
143+
| `city` | string | |
144+
| `region` | string | Alias: state |
145+
| `district` | string | Alias: county |
146+
| `postcode` | string | Alias: zipcode |
147+
| `country` | string | |
148+
| `phone_number` | PhoneNumber | Derived: area_code, country_code, prefix, line_number |
149+
| `marital_status` | string | Values: never_married, married_present, separated, widowed, divorced |
150+
| `education_level` | string or None | |
151+
| `bachelors_field` | string or None | |
152+
| `occupation` | string or None | |
153+
| `email_address` | string | |
154+
| `national_id` | string |
155+
156+
**Japan-Specific Fields (`ja_JP`):**
157+
- `area`
158+
159+
**India-Specific Fields (`en_IN`, `hi_IN`):**
160+
- `religion` - Census-reported religion
161+
- `education_degree` - Census-reported education degree
162+
- `first_language` - Native language
163+
- `second_language` - Second language (if applicable)
164+
- `third_language` - Third language (if applicable)
165+
- `zone` - Urban vs rural
166+
167+
**With Synthetic Personas Enabled:**
168+
- Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels
169+
- Cultural background narratives
170+
- Skills and competencies
171+
- Hobbies and interests
172+
- Career goals
173+
- Context-specific personas (professional, financial, healthcare, sports, arts & entertainment, travel, culinary, etc.)
174+
175+
### Configuration Parameters
176+
177+
| Parameter | Type | Description |
178+
|-----------|------|-------------|
179+
| `locale` | str | Language/region code - must be one of: "en_US", "ja_JP", "en_IN", "hi_IN" |
180+
| `sex` | str (optional) | Filter by "Male" or "Female" |
181+
| `city` | str or list[str] (optional) | Filter by specific city or cities within locale |
182+
| `age_range` | list[int] (optional) | Two-element list [min_age, max_age] (default: [18, 114]) |
183+
| `with_synthetic_personas` | bool (optional) | Include rich personality profiles (default: False) |
184+
| `select_field_values` | dict (optional) | Custom field-based filtering (e.g., {"state": ["NY", "CA"], "education_level": ["bachelors"]}) |
185+

0 commit comments

Comments
 (0)