|
1 | | -# Generate Realistic Persons |
| 1 | +# Person Sampling in Data Designer |
2 | 2 |
|
3 | | -Data Designer's [SamplerColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) can be used to sample realistic person data and synthetic personas. Generated using Data Designer itself, as well as a Probabilistic Graphical Model trained on census data, the sampled datasets are grounded in real-world demographic, geographic, and personality trait distributions to capture the diversity and richness of the population. |
| 3 | +Person sampling in Data Designer allows you to generate synthetic person data for your datasets using the Faker library. |
4 | 4 |
|
5 | | -## Person Objects in Data Designer |
| 5 | +## Faker-Based Sampling |
6 | 6 |
|
7 | | -### Creating Person Samplers |
| 7 | +### What It Does |
| 8 | +Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case. |
8 | 9 |
|
9 | | -Person samplers generate realistic person entities with configurable attributes and optional synthetic persona data. Each sampler creates a different person object that you can reference throughout your data design. |
10 | | - |
11 | | -```python |
12 | | -from data_designer.essentials import ( |
13 | | - DataDesignerConfigBuilder, |
14 | | - PersonSamplerParams, |
15 | | - SamplerColumnConfig, |
16 | | -) |
17 | | - |
18 | | -config_builder = DataDesignerConfigBuilder() |
19 | | - |
20 | | -config_builder.add_column( |
21 | | - SamplerColumnConfig( |
22 | | - name="customer", |
23 | | - sampler_type="person", |
24 | | - params=PersonSamplerParams( |
25 | | - locale="en_US", |
26 | | - sex="Male", |
27 | | - with_synthetic_personas=True |
28 | | - ), |
29 | | - ) |
30 | | -) |
31 | | - |
32 | | - |
33 | | -config_builder.add_column( |
34 | | - SamplerColumnConfig( |
35 | | - name="employee", |
36 | | - column_type="sampler", |
37 | | - sampler_type="person", |
38 | | - params=PersonSamplerParams( |
39 | | - locale="ja_JP", |
40 | | - sex="Female", |
41 | | - with_synthetic_personas=False |
42 | | - ), |
43 | | - ) |
44 | | -) |
45 | | - |
46 | | -config_builder.add_column( |
47 | | - SamplerColumnConfig( |
48 | | - name="random_person", |
49 | | - sampler_type="person", |
50 | | - params=PersonSamplerParams(), |
51 | | - ) |
52 | | -) |
53 | | -``` |
54 | | - |
55 | | -### Configuration Options |
56 | | - |
57 | | -Person samplers accept these configuration parameters: |
58 | | - |
59 | | -**Basic Configuration:** |
60 | | - |
61 | | -* `sex`: Specify "Male" or "Female" (optional) |
62 | | -* `locale`: Language and region code (optional, e.g., "en\_US", "ja\_JP", "hi\_IN", "en\_IN") |
63 | | -* `city`: Filter on cities within the specified locale (optional) |
64 | | -* `age_range`: Age range for filtering (default: ages above 18 only) |
65 | | -* `select_field_values`: Filter on specific field values (optional) |
66 | | - |
67 | | -**Synthetic Personas Configuration:** |
68 | | - |
69 | | -* `with_synthetic_personas` (default: False): When set to True, samples detailed personality profiles, cultural backgrounds, skills, interests, and context-specific personas for comprehensive character modeling. The personas are sampled from NVIDIA's [Nemotron-Personas Collection](https://huggingface.co/collections/nvidia/nemotron-personas), which currently includes [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), [Nemotron-Personas-India](https://huggingface.co/datasets/nvidia/Nemotron-Personas-India) and [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan). |
70 | | - |
71 | | -**Filtering Notes:** |
72 | | - |
73 | | -* When using US locale ("en\_US"), you can filter on age range, sex, city, and state |
74 | | -* For non-US locales, filtering is limited to age range, sex, and city only |
75 | | -* You can choose either city or state when filtering, not both |
76 | | - |
77 | | -### Locale Support and Data Quality |
78 | | - |
79 | | -**Grounded in real-world demographic data:** The best quality person / synthetic persona data is generated by sampling from the Nemotron-Personas collection. This is currently supported for the following locales: "en_US", "ja_JP", "hi_IN", and "en_IN". |
80 | | - |
81 | | -**Synthetic Personas:** Available for "en\_US", "ja\_JP", "hi\_IN" and "en\_IN" locales when `with_synthetic_personas=True`. Persona generation adapts to cultural context based on the specified locale and demographic information. |
82 | | - |
83 | | -### Person Data Structure |
84 | | - |
85 | | -#### Core Demographic Fields (Always Available) |
86 | | - |
87 | | -| Field Name | Type | Description | |
88 | | -| ----- | ----- | ----- | |
89 | | -| uuid | str | Unique identifier | |
90 | | -| first\_name | str | Person's first name | |
91 | | -| last\_name | str | Person's last name | |
92 | | -| sex | categorical | Person's sex (Male or Female) | |
93 | | -| age | int | Person's age | |
94 | | -| country | str | Country name | |
95 | | -| marital\_status | categorical | None | Marital status | |
96 | | -| education\_level | categorical | None | Education level | |
97 | | -| bachelors\_field | categorical | None | Field of bachelor's degree | |
98 | | -| occupation | str | None | Occupation | |
99 | | -| birth\_date | date | Calculated birth date based on age | |
100 | | -| email\_address | str | Generated email address (None for age \< 18\) | |
101 | | -| locale | str | Locale | |
102 | | - |
103 | | -#### US-Specific Fields |
104 | | - |
105 | | -| Field Name | Type | Description | |
106 | | -| ----- | ----- | ----- | |
107 | | -| unit | str | Unit/apartment number | |
108 | | -| street\_number | int | str | Street number (numeric or alphanumeric) | |
109 | | -| street\_name | str | Name of the street | |
110 | | -| city | str | City name | |
111 | | -| zipcode | str | Zipcode/Postal Code | |
112 | | -| state | str | State | |
113 | | -| county | str | County | |
114 | | -| bachelors\_field | categorical | Field of bachelor's degree | |
115 | | -| phone\_number | str | Generated phone number based on zipcode (None for age \< 18\) | |
116 | | -| ssn | str | Social Security Number | |
117 | | - |
118 | | -> In addition to the above fields, person objects also contain locale-specific fields for non-US locales such as "area" for "ja_JP". |
119 | | -
|
120 | | -#### Personality Traits (Available when `with_synthetic_personas=True`) |
121 | | - |
122 | | -Big Five personality model with t-scores and interpretive labels: |
123 | | - |
124 | | -| Field Name | Type | Description | |
125 | | -| ----- | ----- | ----- | |
126 | | -| openness | dict | Openness to experience (t\_score, label, description) | |
127 | | -| conscientiousness | dict | Conscientiousness (t\_score, label, description) | |
128 | | -| extraversion | dict | Extraversion (t\_score, label, description) | |
129 | | -| agreeableness | dict | Agreeableness (t\_score, label, description) | |
130 | | -| neuroticism | dict | Neuroticism (t\_score, label, description) | |
131 | | - |
132 | | -Each personality trait contains: |
133 | | - |
134 | | -* `t_score`: Standardized score (typically 0-100) |
135 | | -* `label`: Interpretive label ("low", "average", "high", "very high") |
136 | | -* `description`: Detailed behavioral description |
137 | | - |
138 | | -#### Synthetic Persona Fields (Available when `with_synthetic_personas=True`) |
139 | | - |
140 | | -##### Background and Development |
141 | | - |
142 | | -| Field Name | Type | Description | |
143 | | -| ----- | ----- | ----- | |
144 | | -| cultural\_background | str | Detailed narrative about cultural influences and upbringing | |
145 | | -| skills\_and\_expertise | str | Comprehensive description of professional and personal capabilities | |
146 | | -| skills\_and\_expertise\_list | str | List format of key skills and competencies | |
147 | | -| hobbies\_and\_interests | str | Detailed description of personal interests and activities | |
148 | | -| hobbies\_and\_interests\_list | str | List format of hobbies and interests | |
149 | | -| career\_goals\_and\_ambitions | str | Professional aspirations and long-term objectives | |
150 | | - |
151 | | -##### Persona Profiles |
152 | | - |
153 | | -| Field Name | Type | Description | |
154 | | -| ----- | ----- | ----- | |
155 | | -| persona | str | Brief summary personality profile | |
156 | | -| detailed\_persona | str | Comprehensive personality and behavioral description | |
157 | | -| professional\_persona | str | Work environment personality and career approach | |
158 | | -| finance\_persona | str | Financial decision-making style and money management approach | |
159 | | -| healthcare\_persona | str | Health and wellness attitudes and behaviors | |
160 | | -| sports\_persona | str | Sports interests and physical activity preferences | |
161 | | -| arts\_persona | str | Artistic tastes, cultural interests, and creative preferences | |
162 | | -| travel\_persona | str | Travel style, preferences, and exploration approach | |
163 | | -| culinary\_persona | str | Food interests, cooking style, and dining preferences | |
164 | | - |
165 | | ---- |
166 | | - |
167 | | -## Best Practices |
168 | | - |
169 | | -### Choosing Configuration Options |
170 | | - |
171 | | -* **Use locales that are backed by a Nemotron-Personas dataset** for maximum demographic accuracy and realism |
172 | | -* **Enable `with_synthetic_personas=True`** when you need rich character development, personalized content generation, or comprehensive behavioral modeling |
173 | | -* **Disable synthetic personas** for basic demographic testing or when computational efficiency is prioritized |
174 | | - |
175 | | -### Effective Persona Usage |
176 | | - |
177 | | -* **Match persona depth to use case**: Use basic personas for simple applications, detailed personas for comprehensive character modeling |
178 | | -* **Leverage context-specific personas**: Use `professional_persona` for workplace scenarios, `culinary_persona` for food-related applications |
179 | | -* **Combine multiple persona fields** in prompts for richer, more nuanced content generation |
180 | | - |
181 | | -### Performance Considerations |
182 | | - |
183 | | -* **Synthetic personas add processing time**: Only enable when the additional data provides value |
184 | | -* **Cache person objects** when using the same personas across multiple columns |
185 | | -* **Consider batch generation** for large datasets requiring consistent persona quality |
186 | | - |
187 | | -### Quality Assurance |
188 | | - |
189 | | -* **Validate persona consistency**: Ensure generated content aligns with personality traits and demographic information |
190 | | -* **Test across different locales** to understand quality variations |
191 | | -* **Review persona coherence** when using multiple context-specific personas for the same individual |
192 | | - |
193 | | ---- |
194 | | - |
195 | | -## Person Sampling with Faker |
196 | | - |
197 | | -If you do not have access to Data Designer's managed Nemotron-Personas datasets or you need a locale that is not covered by Nemotron-Personas, Data Designer provides a Faker-based person sampler (`sampler_type="person_from_faker"`) that uses the [Faker library](https://faker.readthedocs.io/en/stable/) to generate person data. |
198 | | - |
199 | | -**Important:** This sampler generates random personal details that are **not grounded in real-world demographic data**. It's best suited for testing, prototyping, or when you need basic person attributes in locales not yet covered by Nemotron-Personas. |
| 10 | +### Features |
| 11 | +- Gives you access to person attributes that Faker exposes |
| 12 | +- Quick to set up with no additional downloads |
| 13 | +- Generates random names, emails, addresses, phone numbers, etc. |
| 14 | +- Supports [all Faker-supported locales](https://faker.readthedocs.io/en/master/locales.html) |
| 15 | +- **Not demographically grounded** - data patterns don't reflect real-world demographics |
200 | 16 |
|
201 | 17 | ### Usage Example |
202 | | - |
203 | 18 | ```python |
204 | 19 | from data_designer.essentials import ( |
205 | | - DataDesignerConfigBuilder, |
206 | | - PersonFromFakerSamplerParams, |
207 | 20 | SamplerColumnConfig, |
| 21 | + SamplerType, |
| 22 | + PersonFromFakerSamplerParams, |
208 | 23 | ) |
209 | 24 |
|
210 | | -config_builder = DataDesignerConfigBuilder() |
211 | | - |
212 | | -# Use any locale supported by Faker |
213 | 25 | config_builder.add_column( |
214 | 26 | SamplerColumnConfig( |
215 | | - name="french_customer", |
216 | | - sampler_type="person_from_faker", |
| 27 | + name="customer", |
| 28 | + sampler_type=SamplerType.PERSON_FROM_FAKER, |
217 | 29 | params=PersonFromFakerSamplerParams( |
218 | | - locale="fr_FR", |
219 | | - sex="Male", |
| 30 | + locale="en_US", |
220 | 31 | age_range=[25, 65], |
| 32 | + sex="Female", |
221 | 33 | ), |
222 | 34 | ) |
223 | 35 | ) |
224 | 36 | ``` |
225 | | - |
226 | | -### Configuration |
227 | | - |
228 | | -The Faker person sampler accepts these parameters: |
229 | | - |
230 | | -* `locale`: Any locale supported by Faker (e.g., "en\_GB", "fr\_FR", "de\_DE", "es\_ES", "it\_IT", "pt\_BR", "zh\_CN"). See [Faker's locale list](https://faker.readthedocs.io/en/master/locales.html) for all options (default: "en\_US") |
231 | | -* `sex`: Specify "Male" or "Female" (optional) |
232 | | -* `city`: Filter on cities within the specified locale (optional) |
233 | | -* `age_range`: Age range for filtering as `[min_age, max_age]` (default: ages above 18 only) |
234 | | - |
235 | | -### Limitations |
236 | | - |
237 | | -* **No synthetic personas**: Does not support `with_synthetic_personas` parameter |
238 | | -* **No demographic accuracy**: Data is randomly generated without realistic demographic distributions or attribute relationships |
239 | | -* **Locale-dependent fields**: Available address and contact fields vary by locale based on Faker's implementation |
240 | | -* **Limited filtering**: Only basic filtering by sex, city, and age range |
0 commit comments