docs: fix links and tweak person sampling (#152)

johnnygreco · web-flow · commit b71c6c11a87d · 2025-12-18T10:10:41.000-08:00
* update person sampling

* update docstring
diff --git a/docs/concepts/person_sampling.md b/docs/concepts/person_sampling.md
@@ -7,7 +7,7 @@ Person sampling in Data Designer allows you to generate synthetic person data fo
 Data Designer provides two ways to generate synthetic people:
 
 1. **Faker-based sampling** - Quick, basic PII generation for testing or when realistic demographic distributions are not relevant for your use case
-2. **Nemotron Personas datasets** - Demographically accurate, rich persona data
+2. **Nemotron-Personas datasets** - Demographically accurate, rich persona data
 
 ---
 
@@ -44,18 +44,19 @@ config_builder.add_column(
 )
 ```
 
-See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
+For mor details, see the documentation for [`SamplerColumnConfig`](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonFromFakerSamplerParams`](../code_reference/sampler_params.md#data_designer.config.sampler_params.PersonFromFakerSamplerParams).
 
 ---
 
-## Approach 2: Nemotron Personas Datasets
+## Approach 2: Nemotron-Personas Datasets
 
 ### What It Does
-Uses curated Nemotron Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
+Uses curated Nemotron-Personas datasets from NVIDIA GPU Cloud (NGC) to generate demographically accurate person data with rich personality profiles and behavioral characteristics.
 
-The NGC datasets are extended versions of the [open-source Nemotron Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality.
+The NGC datasets are extended versions of the [open-source Nemotron-Personas datasets on HuggingFace](https://huggingface.co/collections/nvidia/nemotron-personas), with additional fields and enhanced data quality.
 
 Supported locales:
+
 - `en_US`: United States
 - `ja_JP`: Japan
 - `en_IN`: India
@@ -75,19 +76,26 @@ Supported locales:
 
 ### Prerequisites
 
-You need to download the Nemotron Personas datasets that you want to use from NGC, they are available [here](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas)
+To use the extended Nemotron-Personas datasets with Data Designer, you need to download them [from NGC](https://catalog.ngc.nvidia.com/search?orderBy=scoreDESC&query=nemotron+personas) and move them to the Data Designer managed assets directory.
+
+See below for step-by-step instructions.
+
+### Nemotron-Personas Datasets Setup Instructions
+
+#### Step 0: Obtain an NGC API Key and install the NGC CLI
+
+To download the Nemotron-Personas datasets from NGC, you will need to obtain an NGC API key and install the NGC CLI.
 
 1. **NGC API Key**: Obtain from [NVIDIA GPU Cloud](https://ngc.nvidia.com/)
 2. **NGC CLI**: [NGC CLI](https://org.ngc.nvidia.com/setup/installers/cli)
 
-### Setup Instructions
 
 #### Step 1: Set Your NGC API Key
 ```bash
 export NGC_API_KEY="your-ngc-api-key-here"
 ```
 
-#### Step 2 (option 1): Download Nemotron Personas Datasets via the Data Designer CLI
+#### Step 2 (option 1): Download Nemotron-Personas Datasets via the Data Designer CLI
 
 Once you have the NGC CLI and your NGC API key set up, you can download the datasets via the Data Designer CLI.
 
@@ -101,19 +109,19 @@ Or you can use the interactive mode to select the locales you want to download:
 data-designer download personas
 ```
 
-#### Step 2 (option 2): Download Nemotron Personas Datasets Directly
+#### Step 2 (option 2): Download Nemotron-Personas Datasets Directly
 
 Use the NGC CLI to download the datasets:
 ```bash
-# For Nemotron Personas USA
+# For Nemotron-Personas USA
 ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_us"
 
-# For Nemotron Personas IN
+# For Nemotron-Personas IN
 ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_deva_in"
 ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-hi_latn_in"
 ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-en_in"
 
-# For Nemotron Personas JP
+# For Nemotron-Personas JP
 ngc registry resource download-version "nvidia/nemotron-personas/nemotron-personas-dataset-ja_jp"
 ```
 
@@ -145,7 +153,7 @@ config_builder.add_column(
 )
 ```
 
-See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documentation for more details.
+For more details, see the documentation for [`SamplerColumnConfig`](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonSamplerParams`](../code_reference/sampler_params.md#data_designer.config.sampler_params.PersonSamplerParams).
 
 ### Available Data Fields
 
@@ -176,9 +184,11 @@ See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documenta
 | `national_id` | string |
 
 **Japan-Specific Fields (`ja_JP`):**
+
 - `area`
 
 **India-Specific Fields (`en_IN`, `hi_IN`, `hi_Deva_IN`, `hi_Latn_IN`):**
+
 - `religion` - Census-reported religion
 - `education_degree` - Census-reported education degree
 - `first_language` - Native language
@@ -187,6 +197,7 @@ See the [`SamplerColumnConfig`](../api/columns.md#samplercolumnconfig) documenta
 - `zone` - Urban vs rural
 
 **With Synthetic Personas Enabled:**
+
 - Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) with t-scores and labels
 - Cultural background narratives
 - Skills and competencies
diff --git a/src/data_designer/config/sampler_params.py b/src/data_designer/config/sampler_params.py
@@ -522,6 +522,25 @@ def _validate_locale_with_managed_datasets(self) -> Self:
 
 
 class PersonFromFakerSamplerParams(ConfigBase):
+    """Parameters for sampling synthetic person data with demographic attributes from Faker.
+
+    Uses the Faker library to generate random personal information. The data is basic and not demographically
+    accurate, but is useful for quick testing, prototyping, or when realistic demographic distributions are not
+    relevant for your use case. For demographically accurate person data, use the `PersonSamplerParams` sampler.
+
+    Attributes:
+        locale: Locale string determining the language and geographic region for synthetic people.
+            Can be any locale supported by Faker.
+        sex: If specified, filters to only sample people of the specified sex. Options: "Male" or
+            "Female". If None, samples both sexes.
+        city: If specified, filters to only sample people from the specified city or cities. Can be
+            a single city name (string) or a list of city names.
+        age_range: Two-element list [min_age, max_age] specifying the age range to sample from
+            (inclusive). Defaults to a standard age range. Both values must be between the minimum and
+            maximum allowed ages.
+        sampler_type: Discriminator for the sampler type. Must be `SamplerType.PERSON_FROM_FAKER`.
+    """
+
     locale: str = Field(
         default="en_US",
         description=(