Skip to content

Commit 6caa6e3

Browse files
committed
add plugin docs
1 parent 85158f7 commit 6caa6e3

File tree

6 files changed

+361
-0
lines changed

6 files changed

+361
-0
lines changed

docs/concepts/plugins.md

Whitespace-only changes.

docs/plugins/available.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# 🚧 Coming Soon
2+
3+
This page will list available Data Designer plugins. Stay tuned!

docs/plugins/example.md

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
!!! warning "Experimental Feature"
2+
The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please [open an issue on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new/choose).
3+
4+
5+
# Example Plugin: Index Multiplier
6+
7+
In this guide, we will build a simple plugin that generates values by multiplying the row index by a user-specified multiplier. Admittedly, not the most useful plugin, but it demonstrates the required steps 😜.
8+
9+
A Data Designer plugin is implemented as a Python package with three main components:
10+
11+
1. **Configuration Class**: Defines the parameters users can configure
12+
2. **Task Class**: Contains the core implementation of the plugin
13+
3. **Plugin Object**: Connects the config and task classes to make the plugin discoverable
14+
15+
Let's build the `data-designer-index-multiplier` plugin step by step.
16+
17+
## Step 1: Create a Python package
18+
19+
Data Designer plugins are implemented as Python packages. We recommend using a standard structure for your plugin package.
20+
21+
For example, here is the structure of a `data-designer-index-multiplier` plugin:
22+
23+
```
24+
data-designer-index-multiplier/
25+
├── pyproject.toml
26+
└── src/
27+
└── data_designer_index_multiplier/
28+
├── __init__.py
29+
└── plugin.py
30+
```
31+
32+
## Step 2: Create the config class
33+
34+
The configuration class defines what parameters users can set when using your plugin. For column generator plugins, it must inherit from [SingleColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SingleColumnConfig) and include a [discriminator field](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions).
35+
36+
```python
37+
from typing import Literal
38+
from data_designer.config.column_configs import SingleColumnConfig
39+
40+
class IndexMultiplierColumnConfig(SingleColumnConfig):
41+
"""Configuration for the index multiplier column generator."""
42+
43+
# Configurable parameter for this plugin
44+
multiplier: int = 2
45+
46+
# Required: discriminator field with a unique Literal type
47+
# This value identifies your plugin and becomes its column_type
48+
column_type: Literal["index-multiplier"] = "index-multiplier"
49+
```
50+
51+
**Key points:**
52+
53+
- The `column_type` field must be a `Literal` type with a string default
54+
- This value uniquely identifies your plugin (use kebab-case)
55+
- Add any custom parameters your plugin needs (here: `multiplier`)
56+
- `SingleColumnConfig` is a Pydantic model, so you can leverage all of Pydantic's validation features
57+
58+
## Step 3: Create the task class
59+
60+
The task class implements the actual business logic of the plugin. For column generator plugins, it inherits from [ColumnGenerator](../code_reference/column_generators.md#data_designer.engine.column_generators.generators.base.ColumnGenerator) and must implement a `metadata` static method and `generate` method:
61+
62+
63+
```python
64+
import logging
65+
import pandas as pd
66+
67+
from data_designer.engine.column_generators.generators.base import (
68+
ColumnGenerator,
69+
GenerationStrategy,
70+
GeneratorMetadata,
71+
)
72+
73+
# Data Designer uses the standard Python logging module for logging
74+
logger = logging.getLogger(__name__)
75+
76+
class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
77+
@staticmethod
78+
def metadata() -> GeneratorMetadata:
79+
"""Define metadata about this generator."""
80+
return GeneratorMetadata(
81+
name="index-multiplier",
82+
description="Generates values by multiplying the row index by a user-specified multiplier",
83+
generation_strategy=GenerationStrategy.FULL_COLUMN,
84+
required_resources=None,
85+
)
86+
87+
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
88+
"""Generate the column data.
89+
90+
Args:
91+
data: The current DataFrame being built
92+
93+
Returns:
94+
The DataFrame with the new column added
95+
"""
96+
logger.info(
97+
f"Generating column {self.config.name} "
98+
f"with multiplier {self.config.multiplier}"
99+
)
100+
101+
# Access config via self.config
102+
data[self.config.name] = data.index * self.config.multiplier
103+
104+
return data
105+
```
106+
107+
**Key points:**
108+
109+
- Generic type `ColumnGenerator[IndexMultiplierColumnConfig]` connects the task to its config
110+
- `metadata()` describes your generator and its requirements
111+
- `generation_strategy` can be `FULL_COLUMN`, `ROW_WISE`, or `BATCH`
112+
- `required_resources` lists any required resources (models, artifact storage, etc.). This parameter will change in the future, so keeping it as `None` is safe for now.
113+
- Access configuration parameters via `self.config`
114+
115+
!!! info "Understanding generation_strategy"
116+
The `generation_strategy` specifies how the column generator will generate data.
117+
118+
- **`FULL_COLUMN`**: Generates the entire column at once
119+
- `generate` must take a `pd.DataFrame` as input and return a `pd.DataFrame`
120+
121+
- **`CELL_BY_CELL`**: Generates one cell at a time
122+
- `generate` must take a `dict` as input and return a `dict`
123+
- Supports concurrent workers via a `max_parallel_requests` parameter on the configuration
124+
125+
## Step 4: Create the plugin object
126+
127+
Create a `Plugin` object that makes the plugin discoverable and connects the task and config classes.
128+
129+
```python
130+
from data_designer.plugins import Plugin, PluginType
131+
132+
# Plugin instance - this is what gets loaded via entry point
133+
plugin = Plugin(
134+
task_cls=IndexMultiplierColumnGenerator,
135+
config_cls=IndexMultiplierColumnConfig,
136+
plugin_type=PluginType.COLUMN_GENERATOR,
137+
emoji="🔌",
138+
)
139+
```
140+
141+
### Complete plugin code
142+
143+
Pulling it all together, here is the complete plugin code for `src/data_designer_index_multiplier/plugin.py`:
144+
145+
```python
146+
import logging
147+
from typing import Literal
148+
149+
import pandas as pd
150+
151+
from data_designer.config.column_configs import SingleColumnConfig
152+
from data_designer.engine.column_generators.generators.base import (
153+
ColumnGenerator,
154+
GenerationStrategy,
155+
GeneratorMetadata,
156+
)
157+
from data_designer.plugins import Plugin, PluginType
158+
159+
# Data Designer uses the standard Python logging module for logging
160+
logger = logging.getLogger(__name__)
161+
162+
163+
class IndexMultiplierColumnConfig(SingleColumnConfig):
164+
"""Configuration for the index multiplier column generator."""
165+
166+
# Configurable parameter for this plugin
167+
multiplier: int = 2
168+
169+
# Required: discriminator field with a unique Literal type
170+
# This value identifies your plugin and becomes its column_type
171+
column_type: Literal["index-multiplier"] = "index-multiplier"
172+
173+
174+
class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
175+
@staticmethod
176+
def metadata() -> GeneratorMetadata:
177+
"""Define metadata about this generator."""
178+
return GeneratorMetadata(
179+
name="index-multiplier",
180+
description="Generates values by multiplying the row index by a user-specified multiplier",
181+
generation_strategy=GenerationStrategy.FULL_COLUMN,
182+
required_resources=None,
183+
)
184+
185+
def generate(self, data: pd.DataFrame) -> pd.DataFrame:
186+
"""Generate the column data.
187+
188+
Args:
189+
data: The current DataFrame being built
190+
191+
Returns:
192+
The DataFrame with the new column added
193+
"""
194+
logger.info(
195+
f"Generating column {self.config.name} "
196+
f"with multiplier {self.config.multiplier}"
197+
)
198+
199+
# Access config via self.config
200+
data[self.config.name] = data.index * self.config.multiplier
201+
202+
return data
203+
204+
205+
# Plugin instance - this is what gets loaded via entry point
206+
plugin = Plugin(
207+
task_cls=IndexMultiplierColumnGenerator,
208+
config_cls=IndexMultiplierColumnConfig,
209+
plugin_type=PluginType.COLUMN_GENERATOR,
210+
emoji="🔌",
211+
)
212+
```
213+
214+
## Step 5: Package your plugin
215+
216+
Create a `pyproject.toml` file to define your package and register the entry point:
217+
218+
```toml
219+
[project]
220+
name = "data-designer-index-multiplier"
221+
version = "1.0.0"
222+
description = "Data Designer index multiplier plugin"
223+
requires-python = ">=3.10"
224+
dependencies = [
225+
"data-designer",
226+
]
227+
228+
# Register this plugin via entry points
229+
[project.entry-points."data_designer.plugins"]
230+
index-multiplier = "data_designer_index_multiplier.plugin:plugin"
231+
232+
[build-system]
233+
requires = ["hatchling"]
234+
build-backend = "hatchling.build"
235+
236+
[tool.hatch.build.targets.wheel]
237+
packages = ["src/data_designer_index_multiplier"]
238+
```
239+
240+
!!! info "Entry Point Registration"
241+
Plugins are discovered automatically using [Python entry points](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata). It is important to register your plugin as an entry point under the `data_designer.plugins` group.
242+
243+
The entry point format is:
244+
```toml
245+
[project.entry-points."data_designer.plugins"]
246+
<entry-point-name> = "<module.path>:<plugin-instance-name>"
247+
```
248+
249+
## Step 6: Use your plugin
250+
251+
Install your plugin in editable mode for testing:
252+
253+
```bash
254+
# From the plugin directory
255+
uv pip install -e .
256+
```
257+
258+
Once installed, your plugin works just like built-in column types:
259+
260+
```python
261+
from data_designer_index_multiplier.plugin import IndexMultiplierColumnConfig
262+
263+
from data_designer.essentials import (
264+
CategorySamplerParams,
265+
DataDesigner,
266+
DataDesignerConfigBuilder,
267+
SamplerColumnConfig,
268+
)
269+
270+
data_designer = DataDesigner()
271+
builder = DataDesignerConfigBuilder()
272+
273+
# Add a regular column
274+
builder.add_column(
275+
SamplerColumnConfig(
276+
name="category",
277+
sampler_type="category",
278+
params=CategorySamplerParams(values=["A", "B", "C"]),
279+
)
280+
)
281+
282+
# Add your custom plugin column
283+
builder.add_column(
284+
IndexMultiplierColumnConfig(
285+
name="v",
286+
multiplier=5,
287+
)
288+
)
289+
290+
# Generate data
291+
results = data_designer.create(builder, num_records=10)
292+
print(results.load_dataset())
293+
```
294+
295+
Output:
296+
```
297+
category multiplied-index
298+
0 B 0
299+
1 A 5
300+
2 C 10
301+
3 A 15
302+
4 B 20
303+
...
304+
```
305+
306+
That's it! You have now created and used your first Data Designer plugin. The last step is to package your plugin and share it with the community 🚀

docs/plugins/overview.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Data Designer Plugins
2+
3+
!!! warning "Experimental Feature"
4+
The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please [open an issue on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new/choose).
5+
6+
## What are plugins?
7+
8+
Plugins are Python packages that extend Data Designer's capabilities without modifying the core library. Similar to [VS Code extensions](https://marketplace.visualstudio.com/vscode) and [Pytest plugins](https://docs.pytest.org/en/stable/reference/plugin_list.html), the plugin system empowers you to build specialized extensions for your specific use cases and share them with the community.
9+
10+
**Current capabilities**: Data Designer currently supports plugins for column generators (the column types you pass to the config builder's [add_column](../code_reference/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_column) method).
11+
12+
**Coming soon**: Plugin support for processors, validators, and more!
13+
14+
## How do you use plugins?
15+
16+
A Data Designer plugin is just a Python package configured with an [entry point](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata) that points to a Data Designer `Plugin` object. Using a plugin is as simple as installing the package:
17+
18+
```bash
19+
pip install data-designer-{plugin-name}
20+
```
21+
22+
Once installed, plugins are automatically discovered and ready to use. See the [example plugin](example.md) for a complete walkthrough.
23+
24+
## How do you create plugins?
25+
26+
Creating a plugin involves three main steps:
27+
28+
### 1. Implement the Plugin Components
29+
30+
- Create a task class inheriting from `ColumnGenerator`
31+
- Create a config class inheriting from `SingleColumnConfig`
32+
- Instantiate a `Plugin` object connecting them
33+
34+
### 2. Package Your Plugin
35+
36+
- Set up a Python package with `pyproject.toml`
37+
- Register your plugin using entry points
38+
- Define dependencies (including `data-designer`)
39+
40+
### 3. Share Your Plugin
41+
42+
- Publish to PyPI or another package index
43+
- Share with the community!
44+
45+
**Ready to get started?** See the [Example Plugin](example.md) for a complete walkthrough!

mkdocs.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,18 @@ nav:
2323
- Structured Outputs and Jinja Expressions: notebooks/2-structured-outputs-and-jinja-expressions.ipynb
2424
- Seeding with an External Dataset: notebooks/3-seeding-with-a-dataset.ipynb
2525
- Providing Images as Context: notebooks/4-providing-images-as-context.ipynb
26+
- Plugins:
27+
- Overview: plugins/overview.md
28+
- Example Plugin: plugins/example.md
29+
- Available Plugin List: plugins/available.md
2630
- Code Reference:
2731
- models: code_reference/models.md
2832
- column_configs: code_reference/column_configs.md
2933
- config_builder: code_reference/config_builder.md
3034
- data_designer_config: code_reference/data_designer_config.md
3135
- sampler_params: code_reference/sampler_params.md
3236
- validator_params: code_reference/validator_params.md
37+
- analysis: code_reference/analysis.md
3338

3439
theme:
3540
name: material

src/data_designer/engine/dataset_builders/column_wise_builder.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,8 @@ def _run_cell_by_cell_generator(self, generator: ColumnGenerator) -> None:
171171
max_workers = MAX_CONCURRENCY_PER_NON_LLM_GENERATOR
172172
if isinstance(generator, WithLLMGeneration):
173173
max_workers = generator.inference_parameters.max_parallel_requests
174+
elif hasattr(generator.config, "max_parallel_requests"):
175+
max_workers = generator.config.max_parallel_requests
174176
self._fan_out_with_threads(generator, max_workers=max_workers)
175177

176178
def _run_full_column_generator(self, generator: ColumnGenerator) -> None:

0 commit comments

Comments
 (0)