Skip to content

Commit 7b8e131

Browse files
committed
add persons section
1 parent 01e5002 commit 7b8e131

File tree

2 files changed

+195
-1
lines changed

2 files changed

+195
-1
lines changed

docs/concepts/persons.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Generate Realistic Persons
2+
3+
Data Designer's [SamplerColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) can be used to sample realistic person data and synthetic personas. Generated using Data Designer itself, as well as a Probabilistic Graphical Model trained on census data, the sampled datasets are grounded in real-world demographic, geographic, and personality trait distributions to capture the diversity and richness of the population.
4+
5+
## Person Objects in Data Designer
6+
7+
### Creating Person Samplers
8+
9+
Person samplers generate realistic person entities with configurable attributes and optional synthetic persona data. Each sampler creates a different person object that you can reference throughout your data design.
10+
11+
```python
12+
from data_designer.essentials import (
13+
DataDesignerConfigBuilder,
14+
PersonSamplerParams,
15+
SamplerColumnConfig,
16+
)
17+
18+
config_builder = DataDesignerConfigBuilder()
19+
20+
config_builder.add_column(
21+
SamplerColumnConfig(
22+
name="customer",
23+
sampler_type="person",
24+
params=PersonSamplerParams(
25+
locale="en_US",
26+
sex="Male",
27+
with_synthetic_personas=True
28+
),
29+
)
30+
)
31+
32+
33+
config_builder.add_column(
34+
SamplerColumnConfig(
35+
name="employee",
36+
column_type="sampler",
37+
sampler_type="person",
38+
params=PersonSamplerParams(
39+
locale="ja_JP",
40+
sex="Female",
41+
with_synthetic_personas=False
42+
),
43+
)
44+
)
45+
46+
config_builder.add_column(
47+
SamplerColumnConfig(
48+
name="random_person",
49+
sampler_type="person",
50+
params=PersonSamplerParams(),
51+
)
52+
)
53+
```
54+
55+
### Configuration Options
56+
57+
Person samplers accept these configuration parameters:
58+
59+
**Basic Configuration:**
60+
61+
* `sex`: Specify "Male" or "Female" (optional)
62+
* `locale`: Language and region code (optional, e.g., "en\_US", "ja\_JP", "hi\_IN", "en\_IN", "fr\_FR", "de\_DE")
63+
* `city`: Filter on cities within the specified locale (optional)
64+
* `age_range`: Age range for filtering (default: ages above 18 only)
65+
* `state`: Filter on US states, only valid when locale is set to "en\_US" (optional)
66+
67+
**Synthetic Personas Configuration:**
68+
69+
* `with_synthetic_personas` (default: False): When set to True, samples detailed personality profiles, cultural backgrounds, skills, interests, and context-specific personas for comprehensive character modeling. The personas are sampled from NVIDIA's [Nemotron-Personas Collection](https://huggingface.co/collections/nvidia/nemotron-personas), which currently includes [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), [Nemotron-Personas-India](https://huggingface.co/datasets/nvidia/Nemotron-Personas-India) and [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan).
70+
71+
**Filtering Notes:**
72+
73+
* When using US locale ("en\_US"), you can filter on age range, sex, city, and state
74+
* For non-US locales, filtering is limited to age range, sex, and city only
75+
* You can choose either city or state when filtering, not both
76+
77+
### Locale Support and Data Quality
78+
79+
**Grounded in real-world demographic data:** The best quality person / synthetic persona data is generated by sampling from the Nemotron-Personas collection. This is currently supported for the following locales: "en_US", "ja_JP", "hi_IN", and "en_IN".
80+
81+
**Other Locales:** Data Designer uses the [Faker library](https://faker.readthedocs.io/en/stable/) to generate person data (synthetic personas are not supported). While Faker provides basic attributes like names and addresses, it doesn't maintain the same demographic accuracy or attribute relationships as the Nemotron-Personas datasets.
82+
83+
**Synthetic Personas:** Available for "en\_US", "ja\_JP", "hi\_IN" and "en\_IN" locales when `with_synthetic_personas=True`. Persona generation adapts to cultural context based on the specified locale and demographic information.
84+
85+
### Person Data Structure
86+
87+
#### Core Demographic Fields (Always Available)
88+
89+
| Field Name | Type | Description |
90+
| ----- | ----- | ----- |
91+
| uuid | str | Unique identifier |
92+
| first\_name | str | Person's first name |
93+
| last\_name | str | Person's last name |
94+
| sex | categorical | Person's sex (Male or Female) |
95+
| age | int | Person's age |
96+
| country | str | Country name |
97+
| marital\_status | categorical | None | Marital status |
98+
| education\_level | categorical | None | Education level |
99+
| bachelors\_field | categorical | None | Field of bachelor's degree |
100+
| occupation | str | None | Occupation |
101+
| birth\_date | date | Calculated birth date based on age |
102+
| email\_address | str | Generated email address (None for age \< 18\) |
103+
| locale | str | Locale |
104+
105+
#### US-Specific Fields
106+
107+
| Field Name | Type | Description |
108+
| ----- | ----- | ----- |
109+
| unit | str | Unit/apartment number |
110+
| street\_number | int | str | Street number (numeric or alphanumeric) |
111+
| street\_name | str | Name of the street |
112+
| city | str | City name |
113+
| zipcode | str | Zipcode/Postal Code |
114+
| state | str | State |
115+
| county | str | County |
116+
| bachelors\_field | categorical | Field of bachelor's degree |
117+
| phone\_number | str | Generated phone number based on zipcode (None for age \< 18\) |
118+
| ssn | str | Social Security Number |
119+
120+
> In addition to the above fields, person objects also contain locale-specific fields for non-US locales such as "area" for "ja_JP".
121+
122+
#### Personality Traits (Available when `with_synthetic_personas=True`)
123+
124+
Big Five personality model with t-scores and interpretive labels:
125+
126+
| Field Name | Type | Description |
127+
| ----- | ----- | ----- |
128+
| openness | dict | Openness to experience (t\_score, label, description) |
129+
| conscientiousness | dict | Conscientiousness (t\_score, label, description) |
130+
| extraversion | dict | Extraversion (t\_score, label, description) |
131+
| agreeableness | dict | Agreeableness (t\_score, label, description) |
132+
| neuroticism | dict | Neuroticism (t\_score, label, description) |
133+
134+
Each personality trait contains:
135+
136+
* `t_score`: Standardized score (typically 0-100)
137+
* `label`: Interpretive label ("low", "average", "high", "very high")
138+
* `description`: Detailed behavioral description
139+
140+
#### Synthetic Persona Fields (Available when `with_synthetic_personas=True`)
141+
142+
##### Background and Development
143+
144+
| Field Name | Type | Description |
145+
| ----- | ----- | ----- |
146+
| cultural\_background | str | Detailed narrative about cultural influences and upbringing |
147+
| skills\_and\_expertise | str | Comprehensive description of professional and personal capabilities |
148+
| skills\_and\_expertise\_list | str | List format of key skills and competencies |
149+
| hobbies\_and\_interests | str | Detailed description of personal interests and activities |
150+
| hobbies\_and\_interests\_list | str | List format of hobbies and interests |
151+
| career\_goals\_and\_ambitions | str | Professional aspirations and long-term objectives |
152+
153+
##### Persona Profiles
154+
155+
| Field Name | Type | Description |
156+
| ----- | ----- | ----- |
157+
| persona | str | Brief summary personality profile |
158+
| detailed\_persona | str | Comprehensive personality and behavioral description |
159+
| professional\_persona | str | Work environment personality and career approach |
160+
| finance\_persona | str | Financial decision-making style and money management approach |
161+
| healthcare\_persona | str | Health and wellness attitudes and behaviors |
162+
| sports\_persona | str | Sports interests and physical activity preferences |
163+
| arts\_persona | str | Artistic tastes, cultural interests, and creative preferences |
164+
| travel\_persona | str | Travel style, preferences, and exploration approach |
165+
| culinary\_persona | str | Food interests, cooking style, and dining preferences |
166+
167+
---
168+
169+
## Best Practices
170+
171+
### Choosing Configuration Options
172+
173+
* **Use locales that are backed by a Nemotron-Personas dataset** for maximum demographic accuracy and realism
174+
* **Enable `with_synthetic_personas=True`** when you need rich character development, personalized content generation, or comprehensive behavioral modeling
175+
* **Disable synthetic personas** for basic demographic testing or when computational efficiency is prioritized
176+
177+
### Effective Persona Usage
178+
179+
* **Match persona depth to use case**: Use basic personas for simple applications, detailed personas for comprehensive character modeling
180+
* **Leverage context-specific personas**: Use `professional_persona` for workplace scenarios, `culinary_persona` for food-related applications
181+
* **Combine multiple persona fields** in prompts for richer, more nuanced content generation
182+
183+
### Performance Considerations
184+
185+
* **Synthetic personas add processing time**: Only enable when the additional data provides value
186+
* **Cache person objects** when using the same personas across multiple columns
187+
* **Consider batch generation** for large datasets requiring consistent persona quality
188+
189+
### Quality Assurance
190+
191+
* **Validate persona consistency**: Ensure generated content aligns with personality traits and demographic information
192+
* **Test across different locales** to understand quality variations
193+
* **Review persona coherence** when using multiple context-specific personas for the same individual

mkdocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ nav:
88
- Contributing: CONTRIBUTING.md
99
- Concepts:
1010
- Columns: concepts/columns.md
11-
- Plugins: concepts/plugins.md
11+
- Persons: concepts/persons.md
12+
# - Plugins: concepts/plugins.md
1213
- Tutorials:
1314
- The Basics: notebooks/1-the-basics.ipynb
1415
- Structured Outputs and Jinja Expressions: notebooks/2-structured-outputs-and-jinja-expressions.ipynb

0 commit comments

Comments
 (0)