Skip to content

Commit 4bee6d9

Browse files
kirit93johnnygreco
andauthored
docs: remove nemotron personas sampling from docs (for now) (#60)
* Update persona docs * Updated person sampling docs based on feedback * remove nemotron personas sampling * Remove nemotron personas sampling * Update docs/concepts/person_sampling.md --------- Co-authored-by: Johnny Greco <[email protected]>
1 parent 585df72 commit 4bee6d9

File tree

1 file changed

+17
-221
lines changed

1 file changed

+17
-221
lines changed

docs/concepts/person_sampling.md

Lines changed: 17 additions & 221 deletions
Original file line numberDiff line numberDiff line change
@@ -1,240 +1,36 @@
1-
# Generate Realistic Persons
1+
# Person Sampling in Data Designer
22

3-
Data Designer's [SamplerColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) can be used to sample realistic person data and synthetic personas. Generated using Data Designer itself, as well as a Probabilistic Graphical Model trained on census data, the sampled datasets are grounded in real-world demographic, geographic, and personality trait distributions to capture the diversity and richness of the population.
3+
Person sampling in Data Designer allows you to generate synthetic person data for your datasets using the Faker library.
44

5-
## Person Objects in Data Designer
5+
## Faker-Based Sampling
66

7-
### Creating Person Samplers
7+
### What It Does
8+
Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case.
89

9-
Person samplers generate realistic person entities with configurable attributes and optional synthetic persona data. Each sampler creates a different person object that you can reference throughout your data design.
10-
11-
```python
12-
from data_designer.essentials import (
13-
DataDesignerConfigBuilder,
14-
PersonSamplerParams,
15-
SamplerColumnConfig,
16-
)
17-
18-
config_builder = DataDesignerConfigBuilder()
19-
20-
config_builder.add_column(
21-
SamplerColumnConfig(
22-
name="customer",
23-
sampler_type="person",
24-
params=PersonSamplerParams(
25-
locale="en_US",
26-
sex="Male",
27-
with_synthetic_personas=True
28-
),
29-
)
30-
)
31-
32-
33-
config_builder.add_column(
34-
SamplerColumnConfig(
35-
name="employee",
36-
column_type="sampler",
37-
sampler_type="person",
38-
params=PersonSamplerParams(
39-
locale="ja_JP",
40-
sex="Female",
41-
with_synthetic_personas=False
42-
),
43-
)
44-
)
45-
46-
config_builder.add_column(
47-
SamplerColumnConfig(
48-
name="random_person",
49-
sampler_type="person",
50-
params=PersonSamplerParams(),
51-
)
52-
)
53-
```
54-
55-
### Configuration Options
56-
57-
Person samplers accept these configuration parameters:
58-
59-
**Basic Configuration:**
60-
61-
* `sex`: Specify "Male" or "Female" (optional)
62-
* `locale`: Language and region code (optional, e.g., "en\_US", "ja\_JP", "hi\_IN", "en\_IN")
63-
* `city`: Filter on cities within the specified locale (optional)
64-
* `age_range`: Age range for filtering (default: ages above 18 only)
65-
* `select_field_values`: Filter on specific field values (optional)
66-
67-
**Synthetic Personas Configuration:**
68-
69-
* `with_synthetic_personas` (default: False): When set to True, samples detailed personality profiles, cultural backgrounds, skills, interests, and context-specific personas for comprehensive character modeling. The personas are sampled from NVIDIA's [Nemotron-Personas Collection](https://huggingface.co/collections/nvidia/nemotron-personas), which currently includes [Nemotron-Personas-USA](https://huggingface.co/datasets/nvidia/Nemotron-Personas-USA), [Nemotron-Personas-India](https://huggingface.co/datasets/nvidia/Nemotron-Personas-India) and [Nemotron-Personas-Japan](https://huggingface.co/datasets/nvidia/Nemotron-Personas-Japan).
70-
71-
**Filtering Notes:**
72-
73-
* When using US locale ("en\_US"), you can filter on age range, sex, city, and state
74-
* For non-US locales, filtering is limited to age range, sex, and city only
75-
* You can choose either city or state when filtering, not both
76-
77-
### Locale Support and Data Quality
78-
79-
**Grounded in real-world demographic data:** The best quality person / synthetic persona data is generated by sampling from the Nemotron-Personas collection. This is currently supported for the following locales: "en_US", "ja_JP", "hi_IN", and "en_IN".
80-
81-
**Synthetic Personas:** Available for "en\_US", "ja\_JP", "hi\_IN" and "en\_IN" locales when `with_synthetic_personas=True`. Persona generation adapts to cultural context based on the specified locale and demographic information.
82-
83-
### Person Data Structure
84-
85-
#### Core Demographic Fields (Always Available)
86-
87-
| Field Name | Type | Description |
88-
| ----- | ----- | ----- |
89-
| uuid | str | Unique identifier |
90-
| first\_name | str | Person's first name |
91-
| last\_name | str | Person's last name |
92-
| sex | categorical | Person's sex (Male or Female) |
93-
| age | int | Person's age |
94-
| country | str | Country name |
95-
| marital\_status | categorical | None | Marital status |
96-
| education\_level | categorical | None | Education level |
97-
| bachelors\_field | categorical | None | Field of bachelor's degree |
98-
| occupation | str | None | Occupation |
99-
| birth\_date | date | Calculated birth date based on age |
100-
| email\_address | str | Generated email address (None for age \< 18\) |
101-
| locale | str | Locale |
102-
103-
#### US-Specific Fields
104-
105-
| Field Name | Type | Description |
106-
| ----- | ----- | ----- |
107-
| unit | str | Unit/apartment number |
108-
| street\_number | int | str | Street number (numeric or alphanumeric) |
109-
| street\_name | str | Name of the street |
110-
| city | str | City name |
111-
| zipcode | str | Zipcode/Postal Code |
112-
| state | str | State |
113-
| county | str | County |
114-
| bachelors\_field | categorical | Field of bachelor's degree |
115-
| phone\_number | str | Generated phone number based on zipcode (None for age \< 18\) |
116-
| ssn | str | Social Security Number |
117-
118-
> In addition to the above fields, person objects also contain locale-specific fields for non-US locales such as "area" for "ja_JP".
119-
120-
#### Personality Traits (Available when `with_synthetic_personas=True`)
121-
122-
Big Five personality model with t-scores and interpretive labels:
123-
124-
| Field Name | Type | Description |
125-
| ----- | ----- | ----- |
126-
| openness | dict | Openness to experience (t\_score, label, description) |
127-
| conscientiousness | dict | Conscientiousness (t\_score, label, description) |
128-
| extraversion | dict | Extraversion (t\_score, label, description) |
129-
| agreeableness | dict | Agreeableness (t\_score, label, description) |
130-
| neuroticism | dict | Neuroticism (t\_score, label, description) |
131-
132-
Each personality trait contains:
133-
134-
* `t_score`: Standardized score (typically 0-100)
135-
* `label`: Interpretive label ("low", "average", "high", "very high")
136-
* `description`: Detailed behavioral description
137-
138-
#### Synthetic Persona Fields (Available when `with_synthetic_personas=True`)
139-
140-
##### Background and Development
141-
142-
| Field Name | Type | Description |
143-
| ----- | ----- | ----- |
144-
| cultural\_background | str | Detailed narrative about cultural influences and upbringing |
145-
| skills\_and\_expertise | str | Comprehensive description of professional and personal capabilities |
146-
| skills\_and\_expertise\_list | str | List format of key skills and competencies |
147-
| hobbies\_and\_interests | str | Detailed description of personal interests and activities |
148-
| hobbies\_and\_interests\_list | str | List format of hobbies and interests |
149-
| career\_goals\_and\_ambitions | str | Professional aspirations and long-term objectives |
150-
151-
##### Persona Profiles
152-
153-
| Field Name | Type | Description |
154-
| ----- | ----- | ----- |
155-
| persona | str | Brief summary personality profile |
156-
| detailed\_persona | str | Comprehensive personality and behavioral description |
157-
| professional\_persona | str | Work environment personality and career approach |
158-
| finance\_persona | str | Financial decision-making style and money management approach |
159-
| healthcare\_persona | str | Health and wellness attitudes and behaviors |
160-
| sports\_persona | str | Sports interests and physical activity preferences |
161-
| arts\_persona | str | Artistic tastes, cultural interests, and creative preferences |
162-
| travel\_persona | str | Travel style, preferences, and exploration approach |
163-
| culinary\_persona | str | Food interests, cooking style, and dining preferences |
164-
165-
---
166-
167-
## Best Practices
168-
169-
### Choosing Configuration Options
170-
171-
* **Use locales that are backed by a Nemotron-Personas dataset** for maximum demographic accuracy and realism
172-
* **Enable `with_synthetic_personas=True`** when you need rich character development, personalized content generation, or comprehensive behavioral modeling
173-
* **Disable synthetic personas** for basic demographic testing or when computational efficiency is prioritized
174-
175-
### Effective Persona Usage
176-
177-
* **Match persona depth to use case**: Use basic personas for simple applications, detailed personas for comprehensive character modeling
178-
* **Leverage context-specific personas**: Use `professional_persona` for workplace scenarios, `culinary_persona` for food-related applications
179-
* **Combine multiple persona fields** in prompts for richer, more nuanced content generation
180-
181-
### Performance Considerations
182-
183-
* **Synthetic personas add processing time**: Only enable when the additional data provides value
184-
* **Cache person objects** when using the same personas across multiple columns
185-
* **Consider batch generation** for large datasets requiring consistent persona quality
186-
187-
### Quality Assurance
188-
189-
* **Validate persona consistency**: Ensure generated content aligns with personality traits and demographic information
190-
* **Test across different locales** to understand quality variations
191-
* **Review persona coherence** when using multiple context-specific personas for the same individual
192-
193-
---
194-
195-
## Person Sampling with Faker
196-
197-
If you do not have access to Data Designer's managed Nemotron-Personas datasets or you need a locale that is not covered by Nemotron-Personas, Data Designer provides a Faker-based person sampler (`sampler_type="person_from_faker"`) that uses the [Faker library](https://faker.readthedocs.io/en/stable/) to generate person data.
198-
199-
**Important:** This sampler generates random personal details that are **not grounded in real-world demographic data**. It's best suited for testing, prototyping, or when you need basic person attributes in locales not yet covered by Nemotron-Personas.
10+
### Features
11+
- Gives you access to person attributes that Faker exposes
12+
- Quick to set up with no additional downloads
13+
- Generates random names, emails, addresses, phone numbers, etc.
14+
- Supports [all Faker-supported locales](https://faker.readthedocs.io/en/master/locales.html)
15+
- **Not demographically grounded** - data patterns don't reflect real-world demographics
20016

20117
### Usage Example
202-
20318
```python
20419
from data_designer.essentials import (
205-
DataDesignerConfigBuilder,
206-
PersonFromFakerSamplerParams,
20720
SamplerColumnConfig,
21+
SamplerType,
22+
PersonFromFakerSamplerParams,
20823
)
20924

210-
config_builder = DataDesignerConfigBuilder()
211-
212-
# Use any locale supported by Faker
21325
config_builder.add_column(
21426
SamplerColumnConfig(
215-
name="french_customer",
216-
sampler_type="person_from_faker",
27+
name="customer",
28+
sampler_type=SamplerType.PERSON_FROM_FAKER,
21729
params=PersonFromFakerSamplerParams(
218-
locale="fr_FR",
219-
sex="Male",
30+
locale="en_US",
22031
age_range=[25, 65],
32+
sex="Female",
22133
),
22234
)
22335
)
22436
```
225-
226-
### Configuration
227-
228-
The Faker person sampler accepts these parameters:
229-
230-
* `locale`: Any locale supported by Faker (e.g., "en\_GB", "fr\_FR", "de\_DE", "es\_ES", "it\_IT", "pt\_BR", "zh\_CN"). See [Faker's locale list](https://faker.readthedocs.io/en/master/locales.html) for all options (default: "en\_US")
231-
* `sex`: Specify "Male" or "Female" (optional)
232-
* `city`: Filter on cities within the specified locale (optional)
233-
* `age_range`: Age range for filtering as `[min_age, max_age]` (default: ages above 18 only)
234-
235-
### Limitations
236-
237-
* **No synthetic personas**: Does not support `with_synthetic_personas` parameter
238-
* **No demographic accuracy**: Data is randomly generated without realistic demographic distributions or attribute relationships
239-
* **Locale-dependent fields**: Available address and contact fields vary by locale based on Faker's implementation
240-
* **Limited filtering**: Only basic filtering by sex, city, and age range

0 commit comments

Comments
 (0)