Skip to content

Commit 48fdc8c

Browse files
docs: add initial plugin documentation (#107)
* add docstrings * add analysis modules * include toc for plugins section * add plugin docs * remove scope creep * Update docs/plugins/example.md Co-authored-by: Nabin Mulepati <[email protected]> * address feedback --------- Co-authored-by: Nabin Mulepati <[email protected]>
1 parent a02f7e0 commit 48fdc8c

File tree

12 files changed

+631
-7
lines changed

12 files changed

+631
-7
lines changed

docs/code_reference/analysis.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Analysis
2+
3+
The `analysis` modules provide tools for profiling and analyzing generated datasets. It includes statistics tracking, column profiling, and reporting capabilities.
4+
5+
## Column Statistics
6+
7+
Column statistics are automatically computed for every column after generation. They provide basic metrics specific to the column type. For example, LLM columns track token usage statistics, sampler columns track distribution information, and validation columns track validation success rates.
8+
9+
The classes below are result objects that store the computed statistics for each column type and provide methods for formatting these results for display in reports.
10+
11+
::: data_designer.config.analysis.column_statistics
12+
13+
## Column Profilers
14+
15+
Column profilers are optional analysis tools that provide deeper insights into specific column types. Currently, the only column profiler available is the Judge Score Profiler.
16+
17+
The classes below are result objects that store the computed profiler results and provide methods for formatting these results for display in reports.
18+
19+
::: data_designer.config.analysis.column_profilers
20+
21+
## Dataset Profiler
22+
23+
The [DatasetProfilerResults](#data_designer.config.analysis.dataset_profiler.DatasetProfilerResults) class contains complete profiling results for a generated dataset. It aggregates column-level statistics, metadata, and profiler results, and provides methods to:
24+
25+
- Compute dataset-level metrics (completion percentage, column type summary)
26+
- Filter statistics by column type
27+
- Generate formatted analysis reports via the `to_report()` method
28+
29+
Reports can be displayed in the console or exported to HTML/SVG formats.
30+
31+
::: data_designer.config.analysis.dataset_profiler

docs/concepts/plugins.md

Whitespace-only changes.

docs/js/toc-toggle.js

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,17 @@ if (typeof document$ !== "undefined") {
77
// Check if this is a Concepts page (URL contains /concepts/)
88
const isConceptsPage = window.location.pathname.includes("/concepts/");
99

10-
if (isCodeReferencePage || isConceptsPage) {
11-
// Show TOC for Code Reference and Concepts pages by adding class to body
10+
// Check if this is a Plugins page (URL contains /plugins/)
11+
const isPluginsPage = window.location.pathname.includes("/plugins/");
12+
13+
if (isCodeReferencePage || isConceptsPage || isPluginsPage) {
14+
// Show TOC for Code Reference, Concepts, and Plugins pages by adding class to body
1215
document.body.classList.add("show-toc");
13-
console.log("Code Reference or Concepts page detected - showing TOC");
16+
console.log("Code Reference, Concepts, or Plugins page detected - showing TOC");
1417
} else {
1518
// Hide TOC for all other pages by removing class from body
1619
document.body.classList.remove("show-toc");
17-
console.log("Non-Code Reference/Concepts page - hiding TOC");
20+
console.log("Non-Code Reference/Concepts/Plugins page - hiding TOC");
1821
}
1922
});
2023
} else {

docs/plugins/available.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# 🚧 Coming Soon
2+
3+
This page will list available Data Designer plugins. Stay tuned!

docs/plugins/example.md

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
!!! warning "Experimental Feature"
2+
The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions).
3+
4+
5+
# Example Plugin: Index Multiplier
6+
7+
In this guide, we will build a simple plugin that generates values by multiplying the row index by a user-specified multiplier. Admittedly, not the most useful plugin, but it demonstrates the required steps 😜.
8+
9+
A Data Designer plugin is implemented as a Python package with three main components:
10+
11+
1. **Configuration Class**: Defines the parameters users can configure
12+
2. **Task Class**: Contains the core implementation of the plugin
13+
3. **Plugin Object**: Connects the config and task classes to make the plugin discoverable
14+
15+
Let's build the `data-designer-index-multiplier` plugin step by step.
16+
17+
## Step 1: Create a Python package
18+
19+
Data Designer plugins are implemented as Python packages. We recommend using a standard structure for your plugin package.
20+
21+
For example, here is the structure of a `data-designer-index-multiplier` plugin:
22+
23+
```
24+
data-designer-index-multiplier/
25+
├── pyproject.toml
26+
└── src/
27+
└── data_designer_index_multiplier/
28+
├── __init__.py
29+
└── plugin.py
30+
```
31+
32+
## Step 2: Create the config class
33+
34+
The configuration class defines what parameters users can set when using your plugin. For column generator plugins, it must inherit from [SingleColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SingleColumnConfig) and include a [discriminator field](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions).
35+
36+
```python
37+
from typing import Literal
38+
from data_designer.config.column_configs import SingleColumnConfig
39+
40+
class IndexMultiplierColumnConfig(SingleColumnConfig):
41+
"""Configuration for the index multiplier column generator."""
42+
43+
# Configurable parameter for this plugin
44+
multiplier: int = 2
45+
46+
# Required: discriminator field with a unique Literal type
47+
# This value identifies your plugin and becomes its column_type
48+
column_type: Literal["index-multiplier"] = "index-multiplier"
49+
```
50+
51+
**Key points:**
52+
53+
- The `column_type` field must be a `Literal` type with a string default
54+
- This value uniquely identifies your plugin (use kebab-case)
55+
- Add any custom parameters your plugin needs (here: `multiplier`)
56+
- `SingleColumnConfig` is a Pydantic model, so you can leverage all of Pydantic's validation features
57+
58+
## Step 3: Create the task class
59+
60+
The task class implements the actual business logic of the plugin. For column generator plugins, it inherits from [ColumnGenerator](../code_reference/column_generators.md#data_designer.engine.column_generators.generators.base.ColumnGenerator) and must implement a `metadata` static method and `generate` method:
61+
62+
63+
```python
64+
import logging
65+
import pandas as pd
66+
67+
from data_designer.engine.column_generators.generators.base import (
68+
ColumnGenerator,
69+
GenerationStrategy,
70+
GeneratorMetadata,
71+
)
72+
73+
# Data Designer uses the standard Python logging module for logging
74+
logger = logging.getLogger(__name__)
75+
76+
class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
77+
@staticmethod
78+
def metadata() -> GeneratorMetadata:
79+
"""Define metadata about this generator."""
80+
return GeneratorMetadata(
81+
name="index-multiplier",
82+
description="Generates values by multiplying the row index by a user-specified multiplier",
83+
generation_strategy=GenerationStrategy.FULL_COLUMN,
84+
required_resources=None,
85+
)
86+
87+
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
88+
"""Generate the column data.
89+
90+
Args:
91+
data: The current DataFrame being built
92+
93+
Returns:
94+
The DataFrame with the new column added
95+
"""
96+
logger.info(
97+
f"Generating column {self.config.name} "
98+
f"with multiplier {self.config.multiplier}"
99+
)
100+
101+
# Access config via self.config
102+
data[self.config.name] = data.index * self.config.multiplier
103+
104+
return data
105+
```
106+
107+
**Key points:**
108+
109+
- Generic type `ColumnGenerator[IndexMultiplierColumnConfig]` connects the task to its config
110+
- `metadata()` describes your generator and its requirements
111+
- `generation_strategy` can be `FULL_COLUMN`, `CELL_BY_CELL`
112+
- You have access to the configuration parameters via `self.config`
113+
- `required_resources` lists any required resources (models, artifact storages, etc.). This parameter will evolve in the near future, so keeping it as `None` is safe for now. That said, if your task will use the model registry, adding `data_designer.engine.resources.ResourceType.MODEL_REGISTRY` will enable automatic model health checking for your column generation task.
114+
115+
!!! info "Understanding generation_strategy"
116+
The `generation_strategy` specifies how the column generator will generate data.
117+
118+
- **`FULL_COLUMN`**: Generates the full column (at the batch level) in a single call to `generate`
119+
- `generate` must take as input a `pd.DataFrame` with all previous columns and return a `pd.DataFrame` with the generated column appended
120+
121+
- **`CELL_BY_CELL`**: Generates one cell at a time
122+
- `generate` must take as input a `dict` with key/value pairs for all previous columns and return a `dict` with an additional key/value for the generated cell
123+
- Supports concurrent workers via a `max_parallel_requests` parameter on the configuration
124+
125+
## Step 4: Create the plugin object
126+
127+
Create a `Plugin` object that makes the plugin discoverable and connects the task and config classes.
128+
129+
```python
130+
from data_designer.plugins import Plugin, PluginType
131+
132+
# Plugin instance - this is what gets loaded via entry point
133+
plugin = Plugin(
134+
task_cls=IndexMultiplierColumnGenerator,
135+
config_cls=IndexMultiplierColumnConfig,
136+
plugin_type=PluginType.COLUMN_GENERATOR,
137+
emoji="🔌",
138+
)
139+
```
140+
141+
### Complete plugin code
142+
143+
Pulling it all together, here is the complete plugin code for `src/data_designer_index_multiplier/plugin.py`:
144+
145+
```python
146+
import logging
147+
from typing import Literal
148+
149+
import pandas as pd
150+
151+
from data_designer.config.column_configs import SingleColumnConfig
152+
from data_designer.engine.column_generators.generators.base import (
153+
ColumnGenerator,
154+
GenerationStrategy,
155+
GeneratorMetadata,
156+
)
157+
from data_designer.plugins import Plugin, PluginType
158+
159+
# Data Designer uses the standard Python logging module for logging
160+
logger = logging.getLogger(__name__)
161+
162+
163+
class IndexMultiplierColumnConfig(SingleColumnConfig):
164+
"""Configuration for the index multiplier column generator."""
165+
166+
# Configurable parameter for this plugin
167+
multiplier: int = 2
168+
169+
# Required: discriminator field with a unique Literal type
170+
# This value identifies your plugin and becomes its column_type
171+
column_type: Literal["index-multiplier"] = "index-multiplier"
172+
173+
174+
class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
175+
@staticmethod
176+
def metadata() -> GeneratorMetadata:
177+
"""Define metadata about this generator."""
178+
return GeneratorMetadata(
179+
name="index-multiplier",
180+
description="Generates values by multiplying the row index by a user-specified multiplier",
181+
generation_strategy=GenerationStrategy.FULL_COLUMN,
182+
required_resources=None,
183+
)
184+
185+
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
186+
"""Generate the column data.
187+
188+
Args:
189+
data: The current DataFrame being built
190+
191+
Returns:
192+
The DataFrame with the new column added
193+
"""
194+
logger.info(
195+
f"Generating column {self.config.name} "
196+
f"with multiplier {self.config.multiplier}"
197+
)
198+
199+
# Access config via self.config
200+
data[self.config.name] = data.index * self.config.multiplier
201+
202+
return data
203+
204+
205+
# Plugin instance - this is what gets loaded via entry point
206+
plugin = Plugin(
207+
task_cls=IndexMultiplierColumnGenerator,
208+
config_cls=IndexMultiplierColumnConfig,
209+
plugin_type=PluginType.COLUMN_GENERATOR,
210+
emoji="🔌",
211+
)
212+
```
213+
214+
## Step 5: Package your plugin
215+
216+
Create a `pyproject.toml` file to define your package and register the entry point:
217+
218+
```toml
219+
[project]
220+
name = "data-designer-index-multiplier"
221+
version = "1.0.0"
222+
description = "Data Designer index multiplier plugin"
223+
requires-python = ">=3.10"
224+
dependencies = [
225+
"data-designer",
226+
]
227+
228+
# Register this plugin via entry points
229+
[project.entry-points."data_designer.plugins"]
230+
index-multiplier = "data_designer_index_multiplier.plugin:plugin"
231+
232+
[build-system]
233+
requires = ["hatchling"]
234+
build-backend = "hatchling.build"
235+
236+
[tool.hatch.build.targets.wheel]
237+
packages = ["src/data_designer_index_multiplier"]
238+
```
239+
240+
!!! info "Entry Point Registration"
241+
Plugins are discovered automatically using [Python entry points](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata). It is important to register your plugin as an entry point under the `data_designer.plugins` group.
242+
243+
The entry point format is:
244+
```toml
245+
[project.entry-points."data_designer.plugins"]
246+
<entry-point-name> = "<module.path>:<plugin-instance-name>"
247+
```
248+
249+
## Step 6: Use your plugin
250+
251+
Install your plugin in editable mode for testing:
252+
253+
```bash
254+
# From the plugin directory
255+
uv pip install -e .
256+
```
257+
258+
Once installed, your plugin works just like built-in column types:
259+
260+
```python
261+
from data_designer_index_multiplier.plugin import IndexMultiplierColumnConfig
262+
263+
from data_designer.essentials import (
264+
CategorySamplerParams,
265+
DataDesigner,
266+
DataDesignerConfigBuilder,
267+
SamplerColumnConfig,
268+
)
269+
270+
data_designer = DataDesigner()
271+
builder = DataDesignerConfigBuilder()
272+
273+
# Add a regular column
274+
builder.add_column(
275+
SamplerColumnConfig(
276+
name="category",
277+
sampler_type="category",
278+
params=CategorySamplerParams(values=["A", "B", "C"]),
279+
)
280+
)
281+
282+
# Add your custom plugin column
283+
builder.add_column(
284+
IndexMultiplierColumnConfig(
285+
name="v",
286+
multiplier=5,
287+
)
288+
)
289+
290+
# Generate data
291+
results = data_designer.create(builder, num_records=10)
292+
print(results.load_dataset())
293+
```
294+
295+
Output:
296+
```
297+
category multiplied-index
298+
0 B 0
299+
1 A 5
300+
2 C 10
301+
3 A 15
302+
4 B 20
303+
...
304+
```
305+
306+
That's it! You have now created and used your first Data Designer plugin. The last step is to package your plugin and share it with the community 🚀

0 commit comments

Comments
 (0)