docs: welcome and concepts/columns (#43)

johnnygreco · web-flow · commit d4f32456a92b · 2025-11-17T17:07:01.000-05:00
* add mike

* meth -&gt; method; mod -&gt; module in TOC

* messing with dark/light mode default

* staging stuff

* remove code examples from docstrings

* writing

* add columns with style
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -23,11 +23,11 @@ This guide will help you get started with the contribution process.
 
 Whether you're new to the project or ready to dive in, the resources below will help you get oriented and productive quickly:
 
-1. **[README.md](README.md)** – best place to start to learn the basics of the project
+1. **[README.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/README.md)** – best place to start to learn the basics of the project
 
-2. **[AGENTS.md](AGENTS.md)** – context and instructions to help AI coding agents work on Data Designer (it's also useful for human developers!)
+2. **[AGENTS.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/AGENTS.md)** – context and instructions to help AI coding agents work on Data Designer (it's also useful for human developers!)
 
-3. **[Documentation](docs/)** – detailed documentation on Data Designer's capabilities and usage
+3. **[Documentation](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/docs/)** – detailed documentation on Data Designer's capabilities and usage
 
 ## Ways to Contribute
 
diff --git a/Makefile b/Makefile
@@ -86,7 +86,7 @@ test:
 serve-docs-locally:
 	@echo "📝 Building and serving docs..."
 	uv sync --group docs
-	uv run mkdocs serve
+	uv run mkdocs serve --livereload
 
 check-license-headers:
 	@echo "🔍 Checking license headers in all files..."
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
@@ -0,0 +1 @@
+../CONTRIBUTING.md
diff --git a/docs/concepts/columns.md b/docs/concepts/columns.md
@@ -0,0 +1,136 @@
+# Columns
+
+Columns are the fundamental building blocks in Data Designer. Each column represents a field in your dataset and defines how to generate it—whether that's sampling from a distribution, calling an LLM, or applying a transformation.
+
+!!! note "The Declarative Approach"
+    Columns are **declarative specifications**. You describe *what* you want, and the framework handles *how* to generate it—managing execution order, batching, parallelization, and resources automatically.
+
+## Column Types
+
+Data Designer provides nine built-in column types, each optimized for different generation scenarios.
+
+### 🎲 Sampler Columns
+
+Sampler columns generate data using numerical sampling—fast, deterministic, and ideal for numerical and categorical dataset fields. They're significantly faster than LLMs and can produce data following specific distributions (Poisson for event counts, Gaussian for measurements, etc.).
+
+Available sampler types:
+
+- **UUID**: Unique identifiers
+- **Category**: Categorical values with optional probability weights
+- **Subcategory**: Hierarchical categorical data (states within countries, models within brands)
+- **Uniform**: Evenly distributed numbers (integers or floats)
+- **Gaussian**: Normally distributed values with configurable mean and standard deviation
+- **Bernoulli**: Binary outcomes with specified success probability
+- **Bernoulli Mixture**: Binary outcomes from multiple probability components
+- **Binomial**: Count of successes in repeated trials
+- **Poisson**: Count data and event frequencies
+- **Scipy**: Access to the full scipy.stats distribution library
+- **Person**: Realistic synthetic individuals with names, demographics, and attributes
+- **Datetime**: Timestamps within specified ranges
+- **Timedelta**: Time duration values
+
+!!! tip "Conditional Sampling"
+    Samplers support **conditional parameters** that change behavior based on other columns. Want age distributions that vary by country? Income ranges that depend on occupation? Just define conditions on existing column values.
+
+### 📝 LLM-Text Columns
+
+LLM-Text columns generate natural language text: product descriptions, customer reviews, narrative summaries, email threads, or anything requiring semantic understanding and creativity.
+
+Use **Jinja2 templating** in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt.
+
+!!! note "Reasoning Traces"
+    Models that support extended thinking (chain-of-thought reasoning) can capture their reasoning process in a separate `{column_name}__reasoning_trace` column–useful for understanding *why* the model generated specific content. This column is automatically added to the dataset if the model and service provider parse and return reasoning content.
+
+### 💻 LLM-Code Columns
+
+LLM-Code columns generate code in specific programming languages. They handle the prompting and parsing necessary to extract clean code from the LLM's response—automatically detecting and extracting code from markdown blocks. You provide the prompt and choose the model; the column handles the extraction.
+
+Supported languages: **Python, JavaScript, TypeScript, Java, Kotlin, Go, Rust, Ruby, Scala, Swift**, plus **SQL** dialects (SQLite, PostgreSQL, MySQL, T-SQL, BigQuery, ANSI SQL).
+
+### 🗂️ LLM-Structured Columns
+
+LLM-Structured columns generate JSON with a *guaranteed schema*. Define your structure using a Pydantic model or JSON schema, and Data Designer ensures the LLM output conforms—no parsing errors, no schema drift.
+
+Use for complex nested structures: API responses, configuration files, database records with multiple related fields, or any structured data where type safety matters. Schemas can be arbitrarily complex with nested objects, arrays, enums, and validation constraints, but success depends on the model's capabilities.
+
+!!! tip "Schema Complexity and Model Choice"
+    Flat schemas with simple fields are easier and more robustly produced across models. Deeply nested schemas with complex validation constraints are more sensitive to model choice—stronger models handle complexity better. If you're experiencing schema conformance issues, try simplifying the schema or switching to a more capable model.
+
+### ⚖️ LLM-Judge Columns
+
+LLM-Judge columns score generated content across multiple quality dimensions using LLMs as evaluators.
+
+Define scoring rubrics (relevance, accuracy, fluency, helpfulness) and the judge model evaluates each record. Score rubrics specify criteria and scoring options (1-5 scales, categorical grades, etc.), producing quantified quality metrics for every data point.
+
+Use judge columns for data quality filtering (e.g., keep only 4+ rated responses), A/B testing generation strategies, and quality monitoring over time.
+
+### 🧩 Expression Columns
+
+Expression columns handle simple transformations using **Jinja2 templates**—concatenate first and last names, calculate numerical totals, format date strings. No LLM overhead needed.
+
+Template capabilities:
+
+- **Variable substitution**: Pull values from any existing column
+- **String filters**: Uppercase, lowercase, strip whitespace, replace patterns
+- **Conditional logic**: if/elif/else support
+- **Arithmetic**: Add, subtract, multiply, divide
+
+### 🔍 Validation Columns
+
+Validation columns check generated content against rules and return structured pass/fail results.
+
+Built-in validation types:
+
+**Code validation** runs Python or SQL code through a linter to validate the code.
+
+**Local callable validation** accepts a Python function directly when using Data Designer as a library.
+
+**Remote validation** sends data to HTTP endpoints for validation-as-a-service. Useful for linters, security scanners, or proprietary systems.
+
+### 🌱 Seed Dataset Columns
+
+Seed dataset columns bootstrap generation from existing data. Provide a real dataset, and those columns become available as context for generating new synthetic data.
+
+Typical pattern: use seed data for one part of your schema (real product names and categories), then generate synthetic fields around it (customer reviews, purchase histories, ratings). The seed data provides realism and constraints; generated columns add volume and variation.
+
+## Shared Column Properties
+
+Every column configuration inherits from `SingleColumnConfig` with these standard properties:
+
+### `name`
+
+The column's identifier—unique within your configuration, used in Jinja2 references, and becomes the column name in the output DataFrame. Choose descriptive names: `user_review` > `col_17`.
+
+### `drop`
+
+Boolean flag (default: `False`) controlling whether the column appears in final output. Setting `drop=True` generates the column (available as a dependency) but excludes it from final output.
+
+**When to drop columns:**
+
+- Intermediate calculations that feed expressions but aren't meaningful standalone
+- Context columns used only for LLM prompt templates
+- Validation results during development unwanted in production
+
+Dropped columns participate fully in generation and the dependency graph—just filtered out at the end.
+
+### `column_type`
+
+Literal string identifying the column type: `"sampler"`, `"llm-text"`, `"expression"`, etc. Set automatically by each configuration class and serves as Pydantic's discriminator for deserialization.
+
+You rarely set this manually—instantiating `LLMTextColumnConfig` automatically sets `column_type="llm-text"`. Serialization is reversible: save to YAML, load later, and Pydantic reconstructs the exact objects.
+
+### `required_columns`
+
+Computed property listing columns that must be generated before this one. The framework derives this automatically:
+
+- For LLM/Expression columns: extracted from Jinja2 template `{{ variables }}`
+- For Validation columns: explicitly listed target columns
+- For Sampler columns with conditional parameters: columns referenced in conditions
+
+You read this property for introspection but never set it—always computed from configuration details.
+
+### `side_effect_columns`
+
+Computed property listing columns created implicitly alongside the primary column. Currently, only LLM columns produce side effects (reasoning trace columns like `{name}__reasoning_trace` when models use extended thinking).
+
+For detailed information on each column type, refer to the [column configuration code reference](../code_reference/column_configs.md).
diff --git a/docs/concepts/plugins.md b/docs/concepts/plugins.md
diff --git a/docs/css/mkdocstrings.css b/docs/css/mkdocstrings.css
@@ -70,3 +70,11 @@ div.doc-contents:not(.first) {
       translate: calc(var(--tree-thickness) * -1) calc(var(--tree-thickness) * -1);
     }
   }
+
+  .doc-symbol-toc.doc-symbol-module::after {
+    content: "module";
+  }
+
+  .doc-symbol-toc.doc-symbol-method::after {
+    content: "method";
+  }
diff --git a/docs/css/style.css b/docs/css/style.css
@@ -96,6 +96,20 @@ body.show-toc .md-sidebar.md-sidebar--secondary {
     display: block !important;
 }
 
+/* Add color to TOC links on Concepts pages */
+body.show-toc .md-sidebar--secondary .md-nav__link {
+    transition: color 0.2s ease;
+}
+
+body.show-toc .md-sidebar--secondary .md-nav__link:hover {
+    color: #76B900 !important;
+}
+
+body.show-toc .md-sidebar--secondary .md-nav__link--active {
+    color: #76B900 !important;
+    font-weight: 500;
+}
+
 /* Hide footer */
 .md-footer {
     display: none !important;
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,47 @@
+# 🎨 NeMo Data Designer Library
+
+[![GitHub](https://img.shields.io/badge/github-repo-952fc6?logo=github)](https://github.com/NVIDIA-NeMo/DataDesigner) [![License](https://img.shields.io/badge/License-Apache_2.0-0074df.svg)](https://opensource.org/licenses/Apache-2.0) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html)
+
+👋 Welcome to the Data Designer community! We're excited to have you here.
+
+Data Designer is a **general framework** for generating **high-quality** synthetic data **from scratch** or using your own **seed data** as a starting point for domain-grounded data generation.
+
+## Why Data Designer?
+
+Generating high-quality synthetic data requires much more than iteratively calling an LLM.
+
+Data Designer is **purpose-built** to support large-scale, high-quality data generation, including
+
+  * **Diversity** – statistical distributions and variety that reflect real-world data patterns, not repetitive LLM outputs 
+  * **Correlations** – meaningful relationships between fields that LLMs cannot maintain across independent calls
+  * **Steerability** – flexible control over data characteristics throughout the generation process
+  * **Validation** – automated quality checks and verification that data meets specifications
+  * **Reproducibility** – shareable and reproducible generation workflows
+
+## How does it work?
+
+Data Designer helps you create datasets through an intuitive, **iterative** process:
+
+1.  **⚙️ Configure** your model settings
+    - Bring your own OpenAI-compatible model providers and models
+    - Or use the default model providers and models to get started quickly
+    - Learn more by reading the [model configuration docs](does-not-exist.md)
+2.  **🏗️ Design** your dataset
+    - Iteratively design your dataset, column by column
+    - Leverage tools like statistical samplers and LLMs to generate a variety of data types
+    - Learn more by reading the [column docs](concepts/columns.md) and checking out the [tutorial notebooks](notebooks/1-the-basics.ipynb)
+3.  **🔁 Preview** your results and iterate
+    - Generate a preview dataset stored in memory for fast iteration
+    - Inspect sample records and analysis results to refine your configuration
+    - Try for yourself by running the [tutorial notebooks](notebooks/1-the-basics.ipynb)
+4.  **🖼️ Create** your dataset
+    - Generate your full dataset and save results to disk
+    - Access the generated dataset and associated artifacts for downstream use
+    - Give it a try by running the [tutorial notebooks](notebooks/2-create-your-dataset.ipynb)!
+
+## Library and Microservice
+
+Data Designer is available as both an open-source library and a NeMo microservice.
+
+  * **Open-source Library**: Purpose-built for flexibility and customization, prioritizing UX excellence, modularity, and extensibility.
+  * **NeMo Microservice**: An enterprise-grade solution that offers a seamless transition from the library, allowing you to leverage other NeMo microservices and generate datasets at scale. See the [microservice docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html) for more details.
diff --git a/docs/installation.md b/docs/installation.md
@@ -0,0 +1,29 @@
+Installing Data Designer is as simple as:
+
+=== "pip"
+
+    ```bash
+    pip install data-designer
+    ```
+
+=== "uv"
+
+    ```bash
+    uv add data-designer
+    ```
+
+## Development Installation
+
+To install the latest development version from the GitHub repository:
+
+=== "pip"
+
+    ```bash
+    pip install 'git+https://github.com/NVIDIA-NeMo/DataDesigner@main'
+    ```
+
+=== "uv"
+
+    ```bash
+    uv add 'git+https://github.com/NVIDIA-NeMo/DataDesigner@main'
+    ```
diff --git a/docs/js/toc-toggle.js b/docs/js/toc-toggle.js
@@ -4,14 +4,17 @@ if (typeof document$ !== "undefined") {
         // Check if this is a Code Reference page (contains mkdocstrings content)
         const isCodeReferencePage = document.querySelector(".doc.doc-contents");
 
-        if (isCodeReferencePage) {
-            // Show TOC for Code Reference pages by adding class to body
+        // Check if this is a Concepts page (URL contains /concepts/)
+        const isConceptsPage = window.location.pathname.includes("/concepts/");
+
+        if (isCodeReferencePage || isConceptsPage) {
+            // Show TOC for Code Reference and Concepts pages by adding class to body
             document.body.classList.add("show-toc");
-            console.log("Code Reference page detected - showing TOC");
+            console.log("Code Reference or Concepts page detected - showing TOC");
         } else {
             // Hide TOC for all other pages by removing class from body
             document.body.classList.remove("show-toc");
-            console.log("Non-Code Reference page - hiding TOC");
+            console.log("Non-Code Reference/Concepts page - hiding TOC");
         }
     });
 } else {
diff --git a/docs/welcome.md b/docs/welcome.md
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -3,57 +3,63 @@ repo_url: https://github.com/NVIDIA-NeMo/DataDesigner
 
 nav:
   - Getting Started:
-      - Welcome to Data Designer: welcome.md
+      - Welcome to Data Designer: index.md
       - Installation: installation.md
+      - Contributing: CONTRIBUTING.md
+  - Concepts:
+      - Columns: concepts/columns.md
+      - Plugins: concepts/plugins.md
   - Tutorials:
       - The Basics: notebooks/1-the-basics.ipynb
   - Code Reference:
       - column_configs: code_reference/column_configs.md
       - config_builder: code_reference/config_builder.md
       - data_designer_config: code_reference/data_designer_config.md
-  - NeMo Microservices Documentation: https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/
 
 theme:
   name: material
   font:
     text: Roboto
     code: Fira Code
   icon:
-      logo: material/palette
+      logo: fontawesome/solid/palette
       repo: fontawesome/brands/github-alt
   favicon: assets/palette-favicon.png
   palette:
-      - media: "(prefers-color-scheme: light)"
-        scheme: default
+      - media: "(prefers-color-scheme: dark)"
+        scheme: slate
         primary: black
         toggle:
             icon: material/lightbulb-outline
-            name: Switch to dark mode
-      - media: "(prefers-color-scheme: dark)"
-        scheme: slate
+            name: Switch to light mode
+      - media: "(prefers-color-scheme: light)"
+        scheme: default
         primary: black
         toggle:
             icon: material/lightbulb
-            name: Switch to light mode
+            name: Switch to dark mode
   features:
       - navigation.tabs
       - navigation.sections
       - navigation.expand
       - navigation.path
-      - navigation.indexes
       - content.code.copy
       - content.code.select
       - content.code.annotate
 
+extra:
+  version:
+    provider: mike
+
 watch:
   - src/data_designer
   - docs/
 
 plugins:
   - search
-  - redirects:
-      redirect_maps:
-          index.md: welcome.md
+  - mike:
+      alias_type: symlink
+      canonical_version: latest
   - mkdocs-jupyter:
         execute: false
         include_requirejs: true
@@ -86,6 +92,8 @@ markdown_extensions:
         use_pygments: true
   - tables
   - pymdownx.superfences
+  - pymdownx.tabbed:
+        alternate_style: true
   - attr_list
   - md_in_html
   - admonition
diff --git a/pyproject.toml b/pyproject.toml
@@ -74,6 +74,7 @@ docs = [
   "mkdocs-material>=9.6.22",
   "mkdocs-jupyter>=0.25.1",
   "mkdocs-redirects>=1.2.2",
+  "mike>=2.1.3",
 ]
 notebooks = [
   "jupyter>=1.0.0",
diff --git a/src/data_designer/config/column_configs.py b/src/data_designer/config/column_configs.py
diff --git a/uv.lock b/uv.lock

Original file line number	Diff line number	Diff line change
`@@ -70,3 +70,11 @@ div.doc-contents:not(.first) {`
`70`	`70`	`translate: calc(var(--tree-thickness) * -1) calc(var(--tree-thickness) * -1);`
`71`	`71`	`}`
`72`	`72`	`}`
	`73`	`+`
	`74`	`+ .doc-symbol-toc.doc-symbol-module::after {`
	`75`	`+ content: "module";`
	`76`	`+ }`
	`77`	`+`
	`78`	`+ .doc-symbol-toc.doc-symbol-method::after {`
	`79`	`+ content: "method";`
	`80`	`+ }`