Skip to content

Commit d4f3245

Browse files
authored
docs: welcome and concepts/columns (#43)
* add mike * meth -> method; mod -> module in TOC * messing with dark/light mode default * staging stuff * remove code examples from docstrings * writing * add columns with style
1 parent f8a9b60 commit d4f3245

File tree

15 files changed

+316
-239
lines changed

15 files changed

+316
-239
lines changed

CONTRIBUTING.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@ This guide will help you get started with the contribution process.
2323

2424
Whether you're new to the project or ready to dive in, the resources below will help you get oriented and productive quickly:
2525

26-
1. **[README.md](README.md)** – best place to start to learn the basics of the project
26+
1. **[README.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/README.md)** – best place to start to learn the basics of the project
2727

28-
2. **[AGENTS.md](AGENTS.md)** – context and instructions to help AI coding agents work on Data Designer (it's also useful for human developers!)
28+
2. **[AGENTS.md](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/AGENTS.md)** – context and instructions to help AI coding agents work on Data Designer (it's also useful for human developers!)
2929

30-
3. **[Documentation](docs/)** – detailed documentation on Data Designer's capabilities and usage
30+
3. **[Documentation](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/docs/)** – detailed documentation on Data Designer's capabilities and usage
3131

3232
## Ways to Contribute
3333

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ test:
8686
serve-docs-locally:
8787
@echo "📝 Building and serving docs..."
8888
uv sync --group docs
89-
uv run mkdocs serve
89+
uv run mkdocs serve --livereload
9090

9191
check-license-headers:
9292
@echo "🔍 Checking license headers in all files..."

docs/CONTRIBUTING.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../CONTRIBUTING.md

docs/concepts/columns.md

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Columns
2+
3+
Columns are the fundamental building blocks in Data Designer. Each column represents a field in your dataset and defines how to generate it—whether that's sampling from a distribution, calling an LLM, or applying a transformation.
4+
5+
!!! note "The Declarative Approach"
6+
Columns are **declarative specifications**. You describe *what* you want, and the framework handles *how* to generate it—managing execution order, batching, parallelization, and resources automatically.
7+
8+
## Column Types
9+
10+
Data Designer provides nine built-in column types, each optimized for different generation scenarios.
11+
12+
### 🎲 Sampler Columns
13+
14+
Sampler columns generate data using numerical sampling—fast, deterministic, and ideal for numerical and categorical dataset fields. They're significantly faster than LLMs and can produce data following specific distributions (Poisson for event counts, Gaussian for measurements, etc.).
15+
16+
Available sampler types:
17+
18+
- **UUID**: Unique identifiers
19+
- **Category**: Categorical values with optional probability weights
20+
- **Subcategory**: Hierarchical categorical data (states within countries, models within brands)
21+
- **Uniform**: Evenly distributed numbers (integers or floats)
22+
- **Gaussian**: Normally distributed values with configurable mean and standard deviation
23+
- **Bernoulli**: Binary outcomes with specified success probability
24+
- **Bernoulli Mixture**: Binary outcomes from multiple probability components
25+
- **Binomial**: Count of successes in repeated trials
26+
- **Poisson**: Count data and event frequencies
27+
- **Scipy**: Access to the full scipy.stats distribution library
28+
- **Person**: Realistic synthetic individuals with names, demographics, and attributes
29+
- **Datetime**: Timestamps within specified ranges
30+
- **Timedelta**: Time duration values
31+
32+
!!! tip "Conditional Sampling"
33+
Samplers support **conditional parameters** that change behavior based on other columns. Want age distributions that vary by country? Income ranges that depend on occupation? Just define conditions on existing column values.
34+
35+
### 📝 LLM-Text Columns
36+
37+
LLM-Text columns generate natural language text: product descriptions, customer reviews, narrative summaries, email threads, or anything requiring semantic understanding and creativity.
38+
39+
Use **Jinja2 templating** in prompts to reference other columns. Data Designer automatically manages dependencies and injects the referenced column values into the prompt.
40+
41+
!!! note "Reasoning Traces"
42+
Models that support extended thinking (chain-of-thought reasoning) can capture their reasoning process in a separate `{column_name}__reasoning_trace` column–useful for understanding *why* the model generated specific content. This column is automatically added to the dataset if the model and service provider parse and return reasoning content.
43+
44+
### 💻 LLM-Code Columns
45+
46+
LLM-Code columns generate code in specific programming languages. They handle the prompting and parsing necessary to extract clean code from the LLM's response—automatically detecting and extracting code from markdown blocks. You provide the prompt and choose the model; the column handles the extraction.
47+
48+
Supported languages: **Python, JavaScript, TypeScript, Java, Kotlin, Go, Rust, Ruby, Scala, Swift**, plus **SQL** dialects (SQLite, PostgreSQL, MySQL, T-SQL, BigQuery, ANSI SQL).
49+
50+
### 🗂️ LLM-Structured Columns
51+
52+
LLM-Structured columns generate JSON with a *guaranteed schema*. Define your structure using a Pydantic model or JSON schema, and Data Designer ensures the LLM output conforms—no parsing errors, no schema drift.
53+
54+
Use for complex nested structures: API responses, configuration files, database records with multiple related fields, or any structured data where type safety matters. Schemas can be arbitrarily complex with nested objects, arrays, enums, and validation constraints, but success depends on the model's capabilities.
55+
56+
!!! tip "Schema Complexity and Model Choice"
57+
Flat schemas with simple fields are easier and more robustly produced across models. Deeply nested schemas with complex validation constraints are more sensitive to model choice—stronger models handle complexity better. If you're experiencing schema conformance issues, try simplifying the schema or switching to a more capable model.
58+
59+
### ⚖️ LLM-Judge Columns
60+
61+
LLM-Judge columns score generated content across multiple quality dimensions using LLMs as evaluators.
62+
63+
Define scoring rubrics (relevance, accuracy, fluency, helpfulness) and the judge model evaluates each record. Score rubrics specify criteria and scoring options (1-5 scales, categorical grades, etc.), producing quantified quality metrics for every data point.
64+
65+
Use judge columns for data quality filtering (e.g., keep only 4+ rated responses), A/B testing generation strategies, and quality monitoring over time.
66+
67+
### 🧩 Expression Columns
68+
69+
Expression columns handle simple transformations using **Jinja2 templates**—concatenate first and last names, calculate numerical totals, format date strings. No LLM overhead needed.
70+
71+
Template capabilities:
72+
73+
- **Variable substitution**: Pull values from any existing column
74+
- **String filters**: Uppercase, lowercase, strip whitespace, replace patterns
75+
- **Conditional logic**: if/elif/else support
76+
- **Arithmetic**: Add, subtract, multiply, divide
77+
78+
### 🔍 Validation Columns
79+
80+
Validation columns check generated content against rules and return structured pass/fail results.
81+
82+
Built-in validation types:
83+
84+
**Code validation** runs Python or SQL code through a linter to validate the code.
85+
86+
**Local callable validation** accepts a Python function directly when using Data Designer as a library.
87+
88+
**Remote validation** sends data to HTTP endpoints for validation-as-a-service. Useful for linters, security scanners, or proprietary systems.
89+
90+
### 🌱 Seed Dataset Columns
91+
92+
Seed dataset columns bootstrap generation from existing data. Provide a real dataset, and those columns become available as context for generating new synthetic data.
93+
94+
Typical pattern: use seed data for one part of your schema (real product names and categories), then generate synthetic fields around it (customer reviews, purchase histories, ratings). The seed data provides realism and constraints; generated columns add volume and variation.
95+
96+
## Shared Column Properties
97+
98+
Every column configuration inherits from `SingleColumnConfig` with these standard properties:
99+
100+
### `name`
101+
102+
The column's identifier—unique within your configuration, used in Jinja2 references, and becomes the column name in the output DataFrame. Choose descriptive names: `user_review` > `col_17`.
103+
104+
### `drop`
105+
106+
Boolean flag (default: `False`) controlling whether the column appears in final output. Setting `drop=True` generates the column (available as a dependency) but excludes it from final output.
107+
108+
**When to drop columns:**
109+
110+
- Intermediate calculations that feed expressions but aren't meaningful standalone
111+
- Context columns used only for LLM prompt templates
112+
- Validation results during development unwanted in production
113+
114+
Dropped columns participate fully in generation and the dependency graph—just filtered out at the end.
115+
116+
### `column_type`
117+
118+
Literal string identifying the column type: `"sampler"`, `"llm-text"`, `"expression"`, etc. Set automatically by each configuration class and serves as Pydantic's discriminator for deserialization.
119+
120+
You rarely set this manually—instantiating `LLMTextColumnConfig` automatically sets `column_type="llm-text"`. Serialization is reversible: save to YAML, load later, and Pydantic reconstructs the exact objects.
121+
122+
### `required_columns`
123+
124+
Computed property listing columns that must be generated before this one. The framework derives this automatically:
125+
126+
- For LLM/Expression columns: extracted from Jinja2 template `{{ variables }}`
127+
- For Validation columns: explicitly listed target columns
128+
- For Sampler columns with conditional parameters: columns referenced in conditions
129+
130+
You read this property for introspection but never set it—always computed from configuration details.
131+
132+
### `side_effect_columns`
133+
134+
Computed property listing columns created implicitly alongside the primary column. Currently, only LLM columns produce side effects (reasoning trace columns like `{name}__reasoning_trace` when models use extended thinking).
135+
136+
For detailed information on each column type, refer to the [column configuration code reference](../code_reference/column_configs.md).

docs/concepts/plugins.md

Whitespace-only changes.

docs/css/mkdocstrings.css

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,3 +70,11 @@ div.doc-contents:not(.first) {
7070
translate: calc(var(--tree-thickness) * -1) calc(var(--tree-thickness) * -1);
7171
}
7272
}
73+
74+
.doc-symbol-toc.doc-symbol-module::after {
75+
content: "module";
76+
}
77+
78+
.doc-symbol-toc.doc-symbol-method::after {
79+
content: "method";
80+
}

docs/css/style.css

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,20 @@ body.show-toc .md-sidebar.md-sidebar--secondary {
9696
display: block !important;
9797
}
9898

99+
/* Add color to TOC links on Concepts pages */
100+
body.show-toc .md-sidebar--secondary .md-nav__link {
101+
transition: color 0.2s ease;
102+
}
103+
104+
body.show-toc .md-sidebar--secondary .md-nav__link:hover {
105+
color: #76B900 !important;
106+
}
107+
108+
body.show-toc .md-sidebar--secondary .md-nav__link--active {
109+
color: #76B900 !important;
110+
font-weight: 500;
111+
}
112+
99113
/* Hide footer */
100114
.md-footer {
101115
display: none !important;

docs/index.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# 🎨 NeMo Data Designer Library
2+
3+
[![GitHub](https://img.shields.io/badge/github-repo-952fc6?logo=github)](https://github.com/NVIDIA-NeMo/DataDesigner) [![License](https://img.shields.io/badge/License-Apache_2.0-0074df.svg)](https://opensource.org/licenses/Apache-2.0) [![NeMo Microservices](https://img.shields.io/badge/NeMo-Microservices-76b900)](https://docs.nvidia.com/nemo/microservices/latest/index.html)
4+
5+
👋 Welcome to the Data Designer community! We're excited to have you here.
6+
7+
Data Designer is a **general framework** for generating **high-quality** synthetic data **from scratch** or using your own **seed data** as a starting point for domain-grounded data generation.
8+
9+
## Why Data Designer?
10+
11+
Generating high-quality synthetic data requires much more than iteratively calling an LLM.
12+
13+
Data Designer is **purpose-built** to support large-scale, high-quality data generation, including
14+
15+
* **Diversity** – statistical distributions and variety that reflect real-world data patterns, not repetitive LLM outputs 
16+
* **Correlations** – meaningful relationships between fields that LLMs cannot maintain across independent calls
17+
* **Steerability** – flexible control over data characteristics throughout the generation process
18+
* **Validation** – automated quality checks and verification that data meets specifications
19+
* **Reproducibility** – shareable and reproducible generation workflows
20+
21+
## How does it work?
22+
23+
Data Designer helps you create datasets through an intuitive, **iterative** process:
24+
25+
1. **⚙️ Configure** your model settings
26+
- Bring your own OpenAI-compatible model providers and models
27+
- Or use the default model providers and models to get started quickly
28+
- Learn more by reading the [model configuration docs](does-not-exist.md)
29+
2. **🏗️ Design** your dataset
30+
- Iteratively design your dataset, column by column
31+
- Leverage tools like statistical samplers and LLMs to generate a variety of data types
32+
- Learn more by reading the [column docs](concepts/columns.md) and checking out the [tutorial notebooks](notebooks/1-the-basics.ipynb)
33+
3. **🔁 Preview** your results and iterate
34+
- Generate a preview dataset stored in memory for fast iteration
35+
- Inspect sample records and analysis results to refine your configuration
36+
- Try for yourself by running the [tutorial notebooks](notebooks/1-the-basics.ipynb)
37+
4. **🖼️ Create** your dataset
38+
- Generate your full dataset and save results to disk
39+
- Access the generated dataset and associated artifacts for downstream use
40+
- Give it a try by running the [tutorial notebooks](notebooks/2-create-your-dataset.ipynb)!
41+
42+
## Library and Microservice
43+
44+
Data Designer is available as both an open-source library and a NeMo microservice.
45+
46+
* **Open-source Library**: Purpose-built for flexibility and customization, prioritizing UX excellence, modularity, and extensibility.
47+
* **NeMo Microservice**: An enterprise-grade solution that offers a seamless transition from the library, allowing you to leverage other NeMo microservices and generate datasets at scale. See the [microservice docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html) for more details.

docs/installation.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
Installing Data Designer is as simple as:
2+
3+
=== "pip"
4+
5+
```bash
6+
pip install data-designer
7+
```
8+
9+
=== "uv"
10+
11+
```bash
12+
uv add data-designer
13+
```
14+
15+
## Development Installation
16+
17+
To install the latest development version from the GitHub repository:
18+
19+
=== "pip"
20+
21+
```bash
22+
pip install 'git+https://github.com/NVIDIA-NeMo/DataDesigner@main'
23+
```
24+
25+
=== "uv"
26+
27+
```bash
28+
uv add 'git+https://github.com/NVIDIA-NeMo/DataDesigner@main'
29+
```

docs/js/toc-toggle.js

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,17 @@ if (typeof document$ !== "undefined") {
44
// Check if this is a Code Reference page (contains mkdocstrings content)
55
const isCodeReferencePage = document.querySelector(".doc.doc-contents");
66

7-
if (isCodeReferencePage) {
8-
// Show TOC for Code Reference pages by adding class to body
7+
// Check if this is a Concepts page (URL contains /concepts/)
8+
const isConceptsPage = window.location.pathname.includes("/concepts/");
9+
10+
if (isCodeReferencePage || isConceptsPage) {
11+
// Show TOC for Code Reference and Concepts pages by adding class to body
912
document.body.classList.add("show-toc");
10-
console.log("Code Reference page detected - showing TOC");
13+
console.log("Code Reference or Concepts page detected - showing TOC");
1114
} else {
1215
// Hide TOC for all other pages by removing class from body
1316
document.body.classList.remove("show-toc");
14-
console.log("Non-Code Reference page - hiding TOC");
17+
console.log("Non-Code Reference/Concepts page - hiding TOC");
1518
}
1619
});
1720
} else {

0 commit comments

Comments
 (0)