|
| 1 | +# Generate Realistic Persons |
| 2 | + |
| 3 | +Data Designer's [SamplerColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) can be used to sample realistic person data and synthetic personas. Generated using Data Designer itself, as well as a Probabilistic Graphical Model trained on census data, the sampled datasets are grounded in real-world demographic, geographic, and personality trait distributions to capture the diversity and richness of the population. |
| 4 | + |
| 5 | +## Person Objects in Data Designer |
| 6 | + |
| 7 | +### Creating Person Samplers |
| 8 | + |
| 9 | +Person samplers generate realistic person entities with configurable attributes and optional synthetic persona data. Each sampler creates a different person object that you can reference throughout your data design. |
| 10 | + |
| 11 | +```python |
| 12 | +from data_designer.essentials import ( |
| 13 | + DataDesignerConfigBuilder, |
| 14 | + PersonSamplerParams, |
| 15 | + SamplerColumnConfig, |
| 16 | +) |
| 17 | + |
| 18 | +config_builder = DataDesignerConfigBuilder() |
| 19 | + |
| 20 | +config_builder.add_column( |
| 21 | + SamplerColumnConfig( |
| 22 | + name="customer", |
| 23 | + sampler_type="person", |
| 24 | + params=PersonSamplerParams( |
| 25 | + locale="en_US", |
| 26 | + sex="Male", |
| 27 | + with_synthetic_personas=True |
| 28 | + ), |
| 29 | + ) |
| 30 | +) |
| 31 | + |
| 32 | + |
| 33 | +config_builder.add_column( |
| 34 | + SamplerColumnConfig( |
| 35 | + name="employee", |
| 36 | + column_type="sampler", |
| 37 | + sampler_type="person", |
| 38 | + params=PersonSamplerParams( |
| 39 | + locale="ja_JP", |
| 40 | + sex="Female", |
| 41 | + with_synthetic_personas=False |
| 42 | + ), |
| 43 | + ) |
| 44 | +) |
| 45 | + |
| 46 | +config_builder.add_column( |
| 47 | + SamplerColumnConfig( |
| 48 | + name="random_person", |
| 49 | + sampler_type="person", |
| 50 | + params=PersonSamplerParams(), |
| 51 | + ) |
| 52 | +) |
| 53 | +``` |
| 54 | + |
| 55 | +### Configuration Options |
| 56 | + |
| 57 | +Person samplers accept these configuration parameters: |
| 58 | + |
| 59 | +**Basic Configuration:** |
| 60 | + |
| 61 | +* `sex`: Specify "Male" or "Female" (optional) |
| 62 | +* `locale`: Language and region code (optional, e.g., "en\_US", "ja\_JP", "hi\_IN", "en\_IN") |
| 63 | +* `city`: Filter on cities within the specified locale (optional) |
| 64 | +* `age_range`: Age range for filtering (default: ages above 18 only) |
| 65 | +* `state`: Filter on US states, only valid when locale is set to "en\_US" (optional) |
| 66 | + |
| 67 | +**Synthetic Personas Configuration:** |
| 68 | + |
| 69 | +* `with_synthetic_personas` (default: False): When set to True, samples detailed personality profiles, cultural backgrounds, skills, interests, and context-specific personas for comprehensive character modeling. The personas are sampled from NVIDIA's [Nemotron-Personas Collection](https://huggingface.co/collections/nvidia/nemotron-personas), which currently includes [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), [Nemotron-Personas-India](https://huggingface.co/datasets/nvidia/Nemotron-Personas-India) and [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan). |
| 70 | + |
| 71 | +**Filtering Notes:** |
| 72 | + |
| 73 | +* When using US locale ("en\_US"), you can filter on age range, sex, city, and state |
| 74 | +* For non-US locales, filtering is limited to age range, sex, and city only |
| 75 | +* You can choose either city or state when filtering, not both |
| 76 | + |
| 77 | +### Locale Support and Data Quality |
| 78 | + |
| 79 | +**Grounded in real-world demographic data:** The best quality person / synthetic persona data is generated by sampling from the Nemotron-Personas collection. This is currently supported for the following locales: "en_US", "ja_JP", "hi_IN", and "en_IN". |
| 80 | + |
| 81 | +**Synthetic Personas:** Available for "en\_US", "ja\_JP", "hi\_IN" and "en\_IN" locales when `with_synthetic_personas=True`. Persona generation adapts to cultural context based on the specified locale and demographic information. |
| 82 | + |
| 83 | +### Person Data Structure |
| 84 | + |
| 85 | +#### Core Demographic Fields (Always Available) |
| 86 | + |
| 87 | +| Field Name | Type | Description | |
| 88 | +| ----- | ----- | ----- | |
| 89 | +| uuid | str | Unique identifier | |
| 90 | +| first\_name | str | Person's first name | |
| 91 | +| last\_name | str | Person's last name | |
| 92 | +| sex | categorical | Person's sex (Male or Female) | |
| 93 | +| age | int | Person's age | |
| 94 | +| country | str | Country name | |
| 95 | +| marital\_status | categorical | None | Marital status | |
| 96 | +| education\_level | categorical | None | Education level | |
| 97 | +| bachelors\_field | categorical | None | Field of bachelor's degree | |
| 98 | +| occupation | str | None | Occupation | |
| 99 | +| birth\_date | date | Calculated birth date based on age | |
| 100 | +| email\_address | str | Generated email address (None for age \< 18\) | |
| 101 | +| locale | str | Locale | |
| 102 | + |
| 103 | +#### US-Specific Fields |
| 104 | + |
| 105 | +| Field Name | Type | Description | |
| 106 | +| ----- | ----- | ----- | |
| 107 | +| unit | str | Unit/apartment number | |
| 108 | +| street\_number | int | str | Street number (numeric or alphanumeric) | |
| 109 | +| street\_name | str | Name of the street | |
| 110 | +| city | str | City name | |
| 111 | +| zipcode | str | Zipcode/Postal Code | |
| 112 | +| state | str | State | |
| 113 | +| county | str | County | |
| 114 | +| bachelors\_field | categorical | Field of bachelor's degree | |
| 115 | +| phone\_number | str | Generated phone number based on zipcode (None for age \< 18\) | |
| 116 | +| ssn | str | Social Security Number | |
| 117 | + |
| 118 | +> In addition to the above fields, person objects also contain locale-specific fields for non-US locales such as "area" for "ja_JP". |
| 119 | +
|
| 120 | +#### Personality Traits (Available when `with_synthetic_personas=True`) |
| 121 | + |
| 122 | +Big Five personality model with t-scores and interpretive labels: |
| 123 | + |
| 124 | +| Field Name | Type | Description | |
| 125 | +| ----- | ----- | ----- | |
| 126 | +| openness | dict | Openness to experience (t\_score, label, description) | |
| 127 | +| conscientiousness | dict | Conscientiousness (t\_score, label, description) | |
| 128 | +| extraversion | dict | Extraversion (t\_score, label, description) | |
| 129 | +| agreeableness | dict | Agreeableness (t\_score, label, description) | |
| 130 | +| neuroticism | dict | Neuroticism (t\_score, label, description) | |
| 131 | + |
| 132 | +Each personality trait contains: |
| 133 | + |
| 134 | +* `t_score`: Standardized score (typically 0-100) |
| 135 | +* `label`: Interpretive label ("low", "average", "high", "very high") |
| 136 | +* `description`: Detailed behavioral description |
| 137 | + |
| 138 | +#### Synthetic Persona Fields (Available when `with_synthetic_personas=True`) |
| 139 | + |
| 140 | +##### Background and Development |
| 141 | + |
| 142 | +| Field Name | Type | Description | |
| 143 | +| ----- | ----- | ----- | |
| 144 | +| cultural\_background | str | Detailed narrative about cultural influences and upbringing | |
| 145 | +| skills\_and\_expertise | str | Comprehensive description of professional and personal capabilities | |
| 146 | +| skills\_and\_expertise\_list | str | List format of key skills and competencies | |
| 147 | +| hobbies\_and\_interests | str | Detailed description of personal interests and activities | |
| 148 | +| hobbies\_and\_interests\_list | str | List format of hobbies and interests | |
| 149 | +| career\_goals\_and\_ambitions | str | Professional aspirations and long-term objectives | |
| 150 | + |
| 151 | +##### Persona Profiles |
| 152 | + |
| 153 | +| Field Name | Type | Description | |
| 154 | +| ----- | ----- | ----- | |
| 155 | +| persona | str | Brief summary personality profile | |
| 156 | +| detailed\_persona | str | Comprehensive personality and behavioral description | |
| 157 | +| professional\_persona | str | Work environment personality and career approach | |
| 158 | +| finance\_persona | str | Financial decision-making style and money management approach | |
| 159 | +| healthcare\_persona | str | Health and wellness attitudes and behaviors | |
| 160 | +| sports\_persona | str | Sports interests and physical activity preferences | |
| 161 | +| arts\_persona | str | Artistic tastes, cultural interests, and creative preferences | |
| 162 | +| travel\_persona | str | Travel style, preferences, and exploration approach | |
| 163 | +| culinary\_persona | str | Food interests, cooking style, and dining preferences | |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## Best Practices |
| 168 | + |
| 169 | +### Choosing Configuration Options |
| 170 | + |
| 171 | +* **Use locales that are backed by a Nemotron-Personas dataset** for maximum demographic accuracy and realism |
| 172 | +* **Enable `with_synthetic_personas=True`** when you need rich character development, personalized content generation, or comprehensive behavioral modeling |
| 173 | +* **Disable synthetic personas** for basic demographic testing or when computational efficiency is prioritized |
| 174 | + |
| 175 | +### Effective Persona Usage |
| 176 | + |
| 177 | +* **Match persona depth to use case**: Use basic personas for simple applications, detailed personas for comprehensive character modeling |
| 178 | +* **Leverage context-specific personas**: Use `professional_persona` for workplace scenarios, `culinary_persona` for food-related applications |
| 179 | +* **Combine multiple persona fields** in prompts for richer, more nuanced content generation |
| 180 | + |
| 181 | +### Performance Considerations |
| 182 | + |
| 183 | +* **Synthetic personas add processing time**: Only enable when the additional data provides value |
| 184 | +* **Cache person objects** when using the same personas across multiple columns |
| 185 | +* **Consider batch generation** for large datasets requiring consistent persona quality |
| 186 | + |
| 187 | +### Quality Assurance |
| 188 | + |
| 189 | +* **Validate persona consistency**: Ensure generated content aligns with personality traits and demographic information |
| 190 | +* **Test across different locales** to understand quality variations |
| 191 | +* **Review persona coherence** when using multiple context-specific personas for the same individual |
| 192 | + |
| 193 | +--- |
| 194 | + |
| 195 | +## Person Sampling with Faker |
| 196 | + |
| 197 | +If you do not have access to Data Designer's managed Nemotron-Personas datasets or you need locale that is not covered, Data Designer provides a Faker-based person sampler (`sampler_type="person_from_faker"`) that uses the [Faker library](https://faker.readthedocs.io/en/stable/) to generate person data. |
| 198 | + |
| 199 | +**Important:** This sampler generates random personal details that are **not grounded in real-world demographic data**. It's best suited for testing, prototyping, or when you need basic person attributes in locales not yet covered by Nemotron-Personas. |
| 200 | + |
| 201 | +### Usage Example |
| 202 | + |
| 203 | +```python |
| 204 | +from data_designer.essentials import ( |
| 205 | + DataDesignerConfigBuilder, |
| 206 | + PersonFromFakerSamplerParams, |
| 207 | + SamplerColumnConfig, |
| 208 | +) |
| 209 | + |
| 210 | +config_builder = DataDesignerConfigBuilder() |
| 211 | + |
| 212 | +# Use any locale supported by Faker |
| 213 | +config_builder.add_column( |
| 214 | + SamplerColumnConfig( |
| 215 | + name="french_customer", |
| 216 | + sampler_type="person_from_faker", |
| 217 | + params=PersonFromFakerSamplerParams( |
| 218 | + locale="fr_FR", |
| 219 | + sex="Male", |
| 220 | + age_range=[25, 65], |
| 221 | + ), |
| 222 | + ) |
| 223 | +) |
| 224 | +``` |
| 225 | + |
| 226 | +### Configuration |
| 227 | + |
| 228 | +The Faker person sampler accepts these parameters: |
| 229 | + |
| 230 | +* `locale`: Any locale supported by Faker (e.g., "en\_GB", "fr\_FR", "de\_DE", "es\_ES", "it\_IT", "pt\_BR", "zh\_CN"). See [Faker's locale list](https://faker.readthedocs.io/en/master/locales.html) for all options (default: "en\_US") |
| 231 | +* `sex`: Specify "Male" or "Female" (optional) |
| 232 | +* `city`: Filter on cities within the specified locale (optional) |
| 233 | +* `age_range`: Age range for filtering as `[min_age, max_age]` (default: ages above 18 only) |
| 234 | + |
| 235 | +### Limitations |
| 236 | + |
| 237 | +* **No synthetic personas**: Does not support `with_synthetic_personas` parameter |
| 238 | +* **No demographic accuracy**: Data is randomly generated without realistic demographic distributions or attribute relationships |
| 239 | +* **Locale-dependent fields**: Available address and contact fields vary by locale based on Faker's implementation |
| 240 | +* **Limited filtering**: Only basic filtering by sex, city, and age range |
0 commit comments