Skip to content

Commit 362ec51

Browse files
authored
docs: sampler params code ref and more (#50)
* add sampler params code ref * add persons section * add person from faker sampler
1 parent 01fbf4d commit 362ec51

File tree

4 files changed

+469
-2
lines changed

4 files changed

+469
-2
lines changed
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Sampler Parameters
2+
3+
The `sampler_params` module defines parameter configuration objects for all Data Designer sampler types. Sampler parameters are used within the [SamplerColumnConfig](column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) to specify how values should be generated for sampled columns.
4+
5+
!!! tip "Displaying available samplers and their parameters"
6+
The config builder has an `info` attribute that can be used to display the
7+
available sampler types and their parameters:
8+
```python
9+
config_builder.info.display("samplers")
10+
```
11+
12+
::: data_designer.config.sampler_params

docs/concepts/persons.md

Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Generate Realistic Persons
2+
3+
Data Designer's [SamplerColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) can be used to sample realistic person data and synthetic personas. Generated using Data Designer itself, as well as a Probabilistic Graphical Model trained on census data, the sampled datasets are grounded in real-world demographic, geographic, and personality trait distributions to capture the diversity and richness of the population.
4+
5+
## Person Objects in Data Designer
6+
7+
### Creating Person Samplers
8+
9+
Person samplers generate realistic person entities with configurable attributes and optional synthetic persona data. Each sampler creates a different person object that you can reference throughout your data design.
10+
11+
```python
12+
from data_designer.essentials import (
13+
DataDesignerConfigBuilder,
14+
PersonSamplerParams,
15+
SamplerColumnConfig,
16+
)
17+
18+
config_builder = DataDesignerConfigBuilder()
19+
20+
config_builder.add_column(
21+
SamplerColumnConfig(
22+
name="customer",
23+
sampler_type="person",
24+
params=PersonSamplerParams(
25+
locale="en_US",
26+
sex="Male",
27+
with_synthetic_personas=True
28+
),
29+
)
30+
)
31+
32+
33+
config_builder.add_column(
34+
SamplerColumnConfig(
35+
name="employee",
36+
column_type="sampler",
37+
sampler_type="person",
38+
params=PersonSamplerParams(
39+
locale="ja_JP",
40+
sex="Female",
41+
with_synthetic_personas=False
42+
),
43+
)
44+
)
45+
46+
config_builder.add_column(
47+
SamplerColumnConfig(
48+
name="random_person",
49+
sampler_type="person",
50+
params=PersonSamplerParams(),
51+
)
52+
)
53+
```
54+
55+
### Configuration Options
56+
57+
Person samplers accept these configuration parameters:
58+
59+
**Basic Configuration:**
60+
61+
* `sex`: Specify "Male" or "Female" (optional)
62+
* `locale`: Language and region code (optional, e.g., "en\_US", "ja\_JP", "hi\_IN", "en\_IN")
63+
* `city`: Filter on cities within the specified locale (optional)
64+
* `age_range`: Age range for filtering (default: ages above 18 only)
65+
* `state`: Filter on US states, only valid when locale is set to "en\_US" (optional)
66+
67+
**Synthetic Personas Configuration:**
68+
69+
* `with_synthetic_personas` (default: False): When set to True, samples detailed personality profiles, cultural backgrounds, skills, interests, and context-specific personas for comprehensive character modeling. The personas are sampled from NVIDIA's [Nemotron-Personas Collection](https://huggingface.co/collections/nvidia/nemotron-personas), which currently includes [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), [Nemotron-Personas-India](https://huggingface.co/datasets/nvidia/Nemotron-Personas-India) and [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan).
70+
71+
**Filtering Notes:**
72+
73+
* When using US locale ("en\_US"), you can filter on age range, sex, city, and state
74+
* For non-US locales, filtering is limited to age range, sex, and city only
75+
* You can choose either city or state when filtering, not both
76+
77+
### Locale Support and Data Quality
78+
79+
**Grounded in real-world demographic data:** The best quality person / synthetic persona data is generated by sampling from the Nemotron-Personas collection. This is currently supported for the following locales: "en_US", "ja_JP", "hi_IN", and "en_IN".
80+
81+
**Synthetic Personas:** Available for "en\_US", "ja\_JP", "hi\_IN" and "en\_IN" locales when `with_synthetic_personas=True`. Persona generation adapts to cultural context based on the specified locale and demographic information.
82+
83+
### Person Data Structure
84+
85+
#### Core Demographic Fields (Always Available)
86+
87+
| Field Name | Type | Description |
88+
| ----- | ----- | ----- |
89+
| uuid | str | Unique identifier |
90+
| first\_name | str | Person's first name |
91+
| last\_name | str | Person's last name |
92+
| sex | categorical | Person's sex (Male or Female) |
93+
| age | int | Person's age |
94+
| country | str | Country name |
95+
| marital\_status | categorical | None | Marital status |
96+
| education\_level | categorical | None | Education level |
97+
| bachelors\_field | categorical | None | Field of bachelor's degree |
98+
| occupation | str | None | Occupation |
99+
| birth\_date | date | Calculated birth date based on age |
100+
| email\_address | str | Generated email address (None for age \< 18\) |
101+
| locale | str | Locale |
102+
103+
#### US-Specific Fields
104+
105+
| Field Name | Type | Description |
106+
| ----- | ----- | ----- |
107+
| unit | str | Unit/apartment number |
108+
| street\_number | int | str | Street number (numeric or alphanumeric) |
109+
| street\_name | str | Name of the street |
110+
| city | str | City name |
111+
| zipcode | str | Zipcode/Postal Code |
112+
| state | str | State |
113+
| county | str | County |
114+
| bachelors\_field | categorical | Field of bachelor's degree |
115+
| phone\_number | str | Generated phone number based on zipcode (None for age \< 18\) |
116+
| ssn | str | Social Security Number |
117+
118+
> In addition to the above fields, person objects also contain locale-specific fields for non-US locales such as "area" for "ja_JP".
119+
120+
#### Personality Traits (Available when `with_synthetic_personas=True`)
121+
122+
Big Five personality model with t-scores and interpretive labels:
123+
124+
| Field Name | Type | Description |
125+
| ----- | ----- | ----- |
126+
| openness | dict | Openness to experience (t\_score, label, description) |
127+
| conscientiousness | dict | Conscientiousness (t\_score, label, description) |
128+
| extraversion | dict | Extraversion (t\_score, label, description) |
129+
| agreeableness | dict | Agreeableness (t\_score, label, description) |
130+
| neuroticism | dict | Neuroticism (t\_score, label, description) |
131+
132+
Each personality trait contains:
133+
134+
* `t_score`: Standardized score (typically 0-100)
135+
* `label`: Interpretive label ("low", "average", "high", "very high")
136+
* `description`: Detailed behavioral description
137+
138+
#### Synthetic Persona Fields (Available when `with_synthetic_personas=True`)
139+
140+
##### Background and Development
141+
142+
| Field Name | Type | Description |
143+
| ----- | ----- | ----- |
144+
| cultural\_background | str | Detailed narrative about cultural influences and upbringing |
145+
| skills\_and\_expertise | str | Comprehensive description of professional and personal capabilities |
146+
| skills\_and\_expertise\_list | str | List format of key skills and competencies |
147+
| hobbies\_and\_interests | str | Detailed description of personal interests and activities |
148+
| hobbies\_and\_interests\_list | str | List format of hobbies and interests |
149+
| career\_goals\_and\_ambitions | str | Professional aspirations and long-term objectives |
150+
151+
##### Persona Profiles
152+
153+
| Field Name | Type | Description |
154+
| ----- | ----- | ----- |
155+
| persona | str | Brief summary personality profile |
156+
| detailed\_persona | str | Comprehensive personality and behavioral description |
157+
| professional\_persona | str | Work environment personality and career approach |
158+
| finance\_persona | str | Financial decision-making style and money management approach |
159+
| healthcare\_persona | str | Health and wellness attitudes and behaviors |
160+
| sports\_persona | str | Sports interests and physical activity preferences |
161+
| arts\_persona | str | Artistic tastes, cultural interests, and creative preferences |
162+
| travel\_persona | str | Travel style, preferences, and exploration approach |
163+
| culinary\_persona | str | Food interests, cooking style, and dining preferences |
164+
165+
---
166+
167+
## Best Practices
168+
169+
### Choosing Configuration Options
170+
171+
* **Use locales that are backed by a Nemotron-Personas dataset** for maximum demographic accuracy and realism
172+
* **Enable `with_synthetic_personas=True`** when you need rich character development, personalized content generation, or comprehensive behavioral modeling
173+
* **Disable synthetic personas** for basic demographic testing or when computational efficiency is prioritized
174+
175+
### Effective Persona Usage
176+
177+
* **Match persona depth to use case**: Use basic personas for simple applications, detailed personas for comprehensive character modeling
178+
* **Leverage context-specific personas**: Use `professional_persona` for workplace scenarios, `culinary_persona` for food-related applications
179+
* **Combine multiple persona fields** in prompts for richer, more nuanced content generation
180+
181+
### Performance Considerations
182+
183+
* **Synthetic personas add processing time**: Only enable when the additional data provides value
184+
* **Cache person objects** when using the same personas across multiple columns
185+
* **Consider batch generation** for large datasets requiring consistent persona quality
186+
187+
### Quality Assurance
188+
189+
* **Validate persona consistency**: Ensure generated content aligns with personality traits and demographic information
190+
* **Test across different locales** to understand quality variations
191+
* **Review persona coherence** when using multiple context-specific personas for the same individual
192+
193+
---
194+
195+
## Person Sampling with Faker
196+
197+
If you do not have access to Data Designer's managed Nemotron-Personas datasets or you need locale that is not covered, Data Designer provides a Faker-based person sampler (`sampler_type="person_from_faker"`) that uses the [Faker library](https://faker.readthedocs.io/en/stable/) to generate person data.
198+
199+
**Important:** This sampler generates random personal details that are **not grounded in real-world demographic data**. It's best suited for testing, prototyping, or when you need basic person attributes in locales not yet covered by Nemotron-Personas.
200+
201+
### Usage Example
202+
203+
```python
204+
from data_designer.essentials import (
205+
DataDesignerConfigBuilder,
206+
PersonFromFakerSamplerParams,
207+
SamplerColumnConfig,
208+
)
209+
210+
config_builder = DataDesignerConfigBuilder()
211+
212+
# Use any locale supported by Faker
213+
config_builder.add_column(
214+
SamplerColumnConfig(
215+
name="french_customer",
216+
sampler_type="person_from_faker",
217+
params=PersonFromFakerSamplerParams(
218+
locale="fr_FR",
219+
sex="Male",
220+
age_range=[25, 65],
221+
),
222+
)
223+
)
224+
```
225+
226+
### Configuration
227+
228+
The Faker person sampler accepts these parameters:
229+
230+
* `locale`: Any locale supported by Faker (e.g., "en\_GB", "fr\_FR", "de\_DE", "es\_ES", "it\_IT", "pt\_BR", "zh\_CN"). See [Faker's locale list](https://faker.readthedocs.io/en/master/locales.html) for all options (default: "en\_US")
231+
* `sex`: Specify "Male" or "Female" (optional)
232+
* `city`: Filter on cities within the specified locale (optional)
233+
* `age_range`: Age range for filtering as `[min_age, max_age]` (default: ages above 18 only)
234+
235+
### Limitations
236+
237+
* **No synthetic personas**: Does not support `with_synthetic_personas` parameter
238+
* **No demographic accuracy**: Data is randomly generated without realistic demographic distributions or attribute relationships
239+
* **Locale-dependent fields**: Available address and contact fields vary by locale based on Faker's implementation
240+
* **Limited filtering**: Only basic filtering by sex, city, and age range

mkdocs.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,15 @@ repo_url: https://github.com/NVIDIA-NeMo/DataDesigner
33

44
nav:
55
- Getting Started:
6-
- Welcome to Data Designer: index.md
6+
- Welcome: index.md
77
- Installation: installation.md
88
- Quick Start: quick-start.md
99
- Contributing: CONTRIBUTING.md
1010
- Concepts:
1111
- Columns: concepts/columns.md
1212
- Validators: concepts/validators.md
13-
- Plugins: concepts/plugins.md
13+
- Persons: concepts/persons.md
14+
# - Plugins: concepts/plugins.md
1415
- Tutorials:
1516
- Overview: notebooks/intro.md
1617
- The Basics: notebooks/1-the-basics.ipynb
@@ -25,6 +26,7 @@ nav:
2526
- column_configs: code_reference/column_configs.md
2627
- config_builder: code_reference/config_builder.md
2728
- data_designer_config: code_reference/data_designer_config.md
29+
- sampler_params: code_reference/sampler_params.md
2830
- validator_params: code_reference/validator_params.md
2931

3032
theme:

0 commit comments

Comments
 (0)