Skip to content

Commit b71c6c1

Browse files
authored
docs: fix links and tweak person sampling (#152)
* update person sampling * update docstring
1 parent b635e41 commit b71c6c1

File tree

2 files changed

+43
-13
lines changed

2 files changed

+43
-13
lines changed

docs/concepts/person_sampling.md

Lines changed: 24 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Person sampling in Data Designer allows you to generate synthetic person data fo
77
Data Designer provides two ways to generate synthetic people:
88

99
1. **Faker-based sampling** - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
10-
2. **Nemotron Personas datasets** - Demographically accurate, rich persona data
10+
2. **Nemotron-Personas datasets** - Demographically accurate, rich persona data
1111

1212
---
1313

@@ -44,18 +44,19 @@ config_builder.add_column(
4444
)
4545
```
4646

47-
See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
47+
For mor details, see the documentation for [`SamplerColumnConfig`](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonFromFakerSamplerParams`](../code_reference/sampler_params.md#data_designer.config.sampler_params.PersonFromFakerSamplerParams).
4848

4949
---
5050

51-
## Approach 2: Nemotron Personas Datasets
51+
## Approach 2: Nemotron-Personas Datasets
5252

5353
### What It Does
54-
Uses curated Nemotron Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
54+
Uses curated Nemotron-Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
5555

56-
The NGC datasets are extended versions of the [open-source Nemotron Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality.
56+
The NGC datasets are extended versions of the [open-source Nemotron-Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality.
5757

5858
Supported locales:
59+
5960
- `en_US`: United States
6061
- `ja_JP`: Japan
6162
- `en_IN`: India
@@ -75,19 +76,26 @@ Supported locales:
7576

7677
### Prerequisites
7778

78-
You need to download the Nemotron Personas datasets that you want to use from NGC, they are available [here](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas)
79+
To use the extended Nemotron-Personas datasets with Data Designer, you need to download them [from NGC](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas) and move them to the Data Designer managed assets directory.
80+
81+
See below for step-by-step instructions.
82+
83+
### Nemotron-Personas Datasets Setup Instructions
84+
85+
#### Step 0: Obtain an NGC API Key and install the NGC CLI
86+
87+
To download the Nemotron-Personas datasets from NGC, you will need to obtain an NGC API key and install the NGC CLI.
7988

8089
1. **NGC API Key**: Obtain from [NVIDIA GPU Cloud](https://ngc.nvidia.com/)
8190
2. **NGC CLI**: [NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli)
8291

83-
### Setup Instructions
8492

8593
#### Step 1: Set Your NGC API Key
8694
```bash
8795
export NGC_API_KEY="your-ngc-api-key-here"
8896
```
8997

90-
#### Step 2 (option 1): Download Nemotron Personas Datasets via the Data Designer CLI
98+
#### Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI
9199

92100
Once you have the NGC CLI and your NGC API key set up, you can download the datasets via the Data Designer CLI.
93101

@@ -101,19 +109,19 @@ Or you can use the interactive mode to select the locales you want to download:
101109
data-designer download personas
102110
```
103111

104-
#### Step 2 (option 2): Download Nemotron Personas Datasets Directly
112+
#### Step 2 (option 2): Download Nemotron-Personas Datasets Directly
105113

106114
Use the NGC CLI to download the datasets:
107115
```bash
108-
# For Nemotron Personas USA
116+
# For Nemotron-Personas USA
109117
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us"
110118

111-
# For Nemotron Personas IN
119+
# For Nemotron-Personas IN
112120
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in"
113121
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in"
114122
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in"
115123

116-
# For Nemotron Personas JP
124+
# For Nemotron-Personas JP
117125
ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp"
118126
```
119127

@@ -145,7 +153,7 @@ config_builder.add_column(
145153
)
146154
```
147155

148-
See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
156+
For more details, see the documentation for [`SamplerColumnConfig`](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonSamplerParams`](../code_reference/sampler_params.md#data_designer.config.sampler_params.PersonSamplerParams).
149157

150158
### Available Data Fields
151159

@@ -176,9 +184,11 @@ See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documenta
176184
| `national_id` | string |
177185

178186
**Japan-Specific Fields (`ja_JP`):**
187+
179188
- `area`
180189

181190
**India-Specific Fields (`en_IN`, `hi_IN`, `hi_Deva_IN`, `hi_Latn_IN`):**
191+
182192
- `religion` - Census-reported religion
183193
- `education_degree` - Census-reported education degree
184194
- `first_language` - Native language
@@ -187,6 +197,7 @@ See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documenta
187197
- `zone` - Urban vs rural
188198

189199
**With Synthetic Personas Enabled:**
200+
190201
- Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels
191202
- Cultural background narratives
192203
- Skills and competencies

src/data_designer/config/sampler_params.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -522,6 +522,25 @@ def _validate_locale_with_managed_datasets(self) -> Self:
522522

523523

524524
class PersonFromFakerSamplerParams(ConfigBase):
525+
"""Parameters for sampling synthetic person data with demographic attributes from Faker.
526+
527+
Uses the Faker library to generate random personal information. The data is basic and not demographically
528+
accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not
529+
relevant for your use case. For demographically accurate person data, use the `PersonSamplerParams` sampler.
530+
531+
Attributes:
532+
locale: Locale string determining the language and geographic region for synthetic people.
533+
Can be any locale supported by Faker.
534+
sex: If specified, filters to only sample people of the specified sex. Options: "Male" or
535+
"Female". If None, samples both sexes.
536+
city: If specified, filters to only sample people from the specified city or cities. Can be
537+
a single city name (string) or a list of city names.
538+
age_range: Two-element list [min_age, max_age] specifying the age range to sample from
539+
(inclusive). Defaults to a standard age range. Both values must be between the minimum and
540+
maximum allowed ages.
541+
sampler_type: Discriminator for the sampler type. Must be `SamplerType.PERSON_FROM_FAKER`.
542+
"""
543+
525544
locale: str = Field(
526545
default="en_US",
527546
description=(

0 commit comments

Comments
 (0)