|
1 | 1 | # Person Sampling in Data Designer |
2 | 2 |
|
3 | | -Person sampling in Data Designer allows you to generate synthetic person data for your datasets using the Faker library. |
| 3 | +Person sampling in Data Designer allows you to generate synthetic person data for your datasets. There are two distinct approaches, each with different capabilities and use cases. |
4 | 4 |
|
5 | | -## Faker-Based Sampling |
| 5 | +## Overview |
| 6 | + |
| 7 | +Data Designer provides two ways to generate synthetic people: |
| 8 | + |
| 9 | +1. **Faker-based sampling** - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case |
| 10 | +2. **Nemotron Personas datasets** - Demographically accurate, rich persona data |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Approach 1: Faker-Based Sampling |
6 | 15 |
|
7 | 16 | ### What It Does |
8 | 17 | Uses the Faker library to generate random personal information. The data is basic and not demographically accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not relevant for your use case. |
@@ -34,3 +43,143 @@ config_builder.add_column( |
34 | 43 | ) |
35 | 44 | ) |
36 | 45 | ``` |
| 46 | + |
| 47 | +See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details. |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## Approach 2: Nemotron Personas Datasets |
| 52 | + |
| 53 | +### What It Does |
| 54 | +Uses curated Nemotron Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics. |
| 55 | + |
| 56 | +The NGC datasets are extended versions of the [open-source Nemotron Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality. |
| 57 | + |
| 58 | +### Features |
| 59 | +- **Demographically accurate personal details**: Names, ages, sex, marital status, education, occupation based on census data |
| 60 | +- **Rich persona details**: Comprehensive behavioral profiles including: |
| 61 | + - Big Five personality traits with scores |
| 62 | + - Cultural backgrounds and narratives |
| 63 | + - Skills and hobbies |
| 64 | + - Career goals and aspirations |
| 65 | + - Context-specific personas (professional, financial, healthcare, sports, arts, travel, culinary, etc.) |
| 66 | +- Consistent, referenceable attributes across your dataset |
| 67 | +- Grounded in real-world demographic distributions |
| 68 | + |
| 69 | +### Prerequisites |
| 70 | + |
| 71 | +You need to download the Nemotron Personas datasets that you want to use from NGC, they are available [here](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas) |
| 72 | + |
| 73 | +1. **NGC API Key**: Obtain from [NVIDIA GPU Cloud](https://ngc.nvidia.com/) |
| 74 | +2. **NGC CLI**: [NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli) |
| 75 | + |
| 76 | +### Setup Instructions |
| 77 | + |
| 78 | +#### Step 1: Set Your NGC API Key |
| 79 | +```bash |
| 80 | +export NGC_API_KEY="your-ngc-api-key-here" |
| 81 | +``` |
| 82 | + |
| 83 | +#### Step 2: Download Nemotron Personas Datasets |
| 84 | +Use the NGC CLI to download the datasets: |
| 85 | +```bash |
| 86 | +# For Nemotron Personas USA |
| 87 | +ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us" |
| 88 | + |
| 89 | +# For Nemotron Personas IN |
| 90 | +ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in" |
| 91 | +ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in" |
| 92 | +ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in" |
| 93 | + |
| 94 | +# For Nemotron Personas JP |
| 95 | +ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp" |
| 96 | +``` |
| 97 | + |
| 98 | +Then move the downloaded dataset to the Data Designer managed assets directory: |
| 99 | +```bash |
| 100 | +mkdir -p ~/.data-designer/managed-assets/datasets/ |
| 101 | +mv nemotron-personas-dataset-*/*.parquet ~/.data-designer/managed-assets/datasets/ |
| 102 | +``` |
| 103 | + |
| 104 | +#### Step 3: Use PersonSampler in Your Code |
| 105 | +```python |
| 106 | +from data_designer.essentials import ( |
| 107 | + SamplerColumnConfig, |
| 108 | + SamplerType, |
| 109 | + PersonSamplerParams, |
| 110 | +) |
| 111 | + |
| 112 | +config_builder.add_column( |
| 113 | + SamplerColumnConfig( |
| 114 | + name="customer", |
| 115 | + sampler_type=SamplerType.PERSON, |
| 116 | + params=PersonSamplerParams( |
| 117 | + locale="en_US", |
| 118 | + sex="Female", |
| 119 | + age_range=[25, 45], |
| 120 | + with_synthetic_personas=True, |
| 121 | + ), |
| 122 | + ) |
| 123 | +) |
| 124 | +``` |
| 125 | + |
| 126 | +See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details. |
| 127 | + |
| 128 | +### Available Data Fields |
| 129 | + |
| 130 | +**Core Fields (all locales):** |
| 131 | + |
| 132 | +| Field | Type | Notes | |
| 133 | +|-------|------|-------| |
| 134 | +| `uuid` | UUID | Unique identifier | |
| 135 | +| `first_name` | string | | |
| 136 | +| `middle_name` | string | | |
| 137 | +| `last_name` | string | | |
| 138 | +| `sex` | enum | "Male" or "Female" | |
| 139 | +| `birth_date` | date | Derived: year, month, day | |
| 140 | +| `street_number` | int | | |
| 141 | +| `street_name` | string | | |
| 142 | +| `unit` | string | Address line 2 | |
| 143 | +| `city` | string | | |
| 144 | +| `region` | string | Alias: state | |
| 145 | +| `district` | string | Alias: county | |
| 146 | +| `postcode` | string | Alias: zipcode | |
| 147 | +| `country` | string | | |
| 148 | +| `phone_number` | PhoneNumber | Derived: area_code, country_code, prefix, line_number | |
| 149 | +| `marital_status` | string | Values: never_married, married_present, separated, widowed, divorced | |
| 150 | +| `education_level` | string or None | | |
| 151 | +| `bachelors_field` | string or None | | |
| 152 | +| `occupation` | string or None | | |
| 153 | +| `email_address` | string | | |
| 154 | +| `national_id` | string | |
| 155 | + |
| 156 | +**Japan-Specific Fields (`ja_JP`):** |
| 157 | +- `area` |
| 158 | + |
| 159 | +**India-Specific Fields (`en_IN`, `hi_IN`):** |
| 160 | +- `religion` - Census-reported religion |
| 161 | +- `education_degree` - Census-reported education degree |
| 162 | +- `first_language` - Native language |
| 163 | +- `second_language` - Second language (if applicable) |
| 164 | +- `third_language` - Third language (if applicable) |
| 165 | +- `zone` - Urban vs rural |
| 166 | + |
| 167 | +**With Synthetic Personas Enabled:** |
| 168 | +- Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels |
| 169 | +- Cultural background narratives |
| 170 | +- Skills and competencies |
| 171 | +- Hobbies and interests |
| 172 | +- Career goals |
| 173 | +- Context-specific personas (professional, financial, healthcare, sports, arts & entertainment, travel, culinary, etc.) |
| 174 | + |
| 175 | +### Configuration Parameters |
| 176 | + |
| 177 | +| Parameter | Type | Description | |
| 178 | +|-----------|------|-------------| |
| 179 | +| `locale` | str | Language/region code - must be one of: "en_US", "ja_JP", "en_IN", "hi_IN" | |
| 180 | +| `sex` | str (optional) | Filter by "Male" or "Female" | |
| 181 | +| `city` | str or list[str] (optional) | Filter by specific city or cities within locale | |
| 182 | +| `age_range` | list[int] (optional) | Two-element list [min_age, max_age] (default: [18, 114]) | |
| 183 | +| `with_synthetic_personas` | bool (optional) | Include rich personality profiles (default: False) | |
| 184 | +| `select_field_values` | dict (optional) | Custom field-based filtering (e.g., {"state": ["NY", "CA"], "education_level": ["bachelors"]}) | |
| 185 | + |
0 commit comments