add plugin docs

johnnygreco · johnnygreco · commit 6caa6e30a3da · 2025-12-09T13:24:50.000-05:00
diff --git a/docs/concepts/plugins.md b/docs/concepts/plugins.md
diff --git a/docs/plugins/available.md b/docs/plugins/available.md
@@ -0,0 +1,3 @@
+# 🚧 Coming Soon
+
+This page will list available Data Designer plugins. Stay tuned!
diff --git a/docs/plugins/example.md b/docs/plugins/example.md
@@ -0,0 +1,306 @@
+!!! warning "Experimental Feature"
+    The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please [open an issue on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new/choose).
+
+
+# Example Plugin: Index Multiplier
+
+In this guide, we will build a simple plugin that generates values by multiplying the row index by a user-specified multiplier. Admittedly, not the most useful plugin, but it demonstrates the required steps 😜.
+
+A Data Designer plugin is implemented as a Python package with three main components:
+
+1. **Configuration Class**: Defines the parameters users can configure
+2. **Task Class**: Contains the core implementation of the plugin
+3. **Plugin Object**: Connects the config and task classes to make the plugin discoverable
+
+Let's build the `data-designer-index-multiplier` plugin step by step.
+
+## Step 1: Create a Python package
+
+Data Designer plugins are implemented as Python packages. We recommend using a standard structure for your plugin package.
+
+For example, here is the structure of a `data-designer-index-multiplier` plugin:
+
+```
+data-designer-index-multiplier/
+├── pyproject.toml
+└── src/
+    └── data_designer_index_multiplier/
+        ├── __init__.py
+        └── plugin.py
+```
+
+## Step 2: Create the config class
+
+The configuration class defines what parameters users can set when using your plugin. For column generator plugins, it must inherit from [SingleColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SingleColumnConfig) and include a [discriminator field](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions).
+
+```python
+from typing import Literal
+from data_designer.config.column_configs import SingleColumnConfig
+
+class IndexMultiplierColumnConfig(SingleColumnConfig):
+    """Configuration for the index multiplier column generator."""
+
+    # Configurable parameter for this plugin
+    multiplier: int = 2
+
+    # Required: discriminator field with a unique Literal type
+    # This value identifies your plugin and becomes its column_type
+    column_type: Literal["index-multiplier"] = "index-multiplier"
+```
+
+**Key points:**
+
+- The `column_type` field must be a `Literal` type with a string default
+- This value uniquely identifies your plugin (use kebab-case)
+- Add any custom parameters your plugin needs (here: `multiplier`)
+- `SingleColumnConfig` is a Pydantic model, so you can leverage all of Pydantic's validation features
+
+## Step 3: Create the task class
+
+The task class implements the actual business logic of the plugin. For column generator plugins, it inherits from [ColumnGenerator](../code_reference/column_generators.md#data_designer.engine.column_generators.generators.base.ColumnGenerator) and must implement a `metadata` static method and `generate` method:
+
+
+```python
+import logging
+import pandas as pd
+
+from data_designer.engine.column_generators.generators.base import (
+    ColumnGenerator,
+    GenerationStrategy,
+    GeneratorMetadata,
+)
+
+# Data Designer uses the standard Python logging module for logging
+logger = logging.getLogger(__name__)
+
+class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
+    @staticmethod
+    def metadata() -> GeneratorMetadata:
+        """Define metadata about this generator."""
+        return GeneratorMetadata(
+            name="index-multiplier",
+            description="Generates values by multiplying the row index by a user-specified multiplier",
+            generation_strategy=GenerationStrategy.FULL_COLUMN,
+            required_resources=None,
+        )
+
+    def generate(self, data: pd.DataFrame) -> pd.DataFrame:
+        """Generate the column data.
+
+        Args:
+            data: The current DataFrame being built
+
+        Returns:
+            The DataFrame with the new column added
+        """
+        logger.info(
+            f"Generating column {self.config.name} "
+            f"with multiplier {self.config.multiplier}"
+        )
+
+        # Access config via self.config
+        data[self.config.name] = data.index * self.config.multiplier
+
+        return data
+```
+
+**Key points:**
+
+- Generic type `ColumnGenerator[IndexMultiplierColumnConfig]` connects the task to its config
+- `metadata()` describes your generator and its requirements
+- `generation_strategy` can be `FULL_COLUMN`, `ROW_WISE`, or `BATCH`
+- `required_resources` lists any required resources (models, artifact storage, etc.). This parameter will change in the future, so keeping it as `None` is safe for now.
+- Access configuration parameters via `self.config`
+
+!!! info "Understanding generation_strategy"
+    The `generation_strategy` specifies how the column generator will generate data.
+
+    - **`FULL_COLUMN`**: Generates the entire column at once
+        - `generate` must take a `pd.DataFrame` as input and return a `pd.DataFrame`
+
+    - **`CELL_BY_CELL`**: Generates one cell at a time
+        - `generate` must take a `dict` as input and return a `dict`
+        - Supports concurrent workers via a `max_parallel_requests` parameter on the configuration
+
+## Step 4: Create the plugin object
+
+Create a `Plugin` object that makes the plugin discoverable and connects the task and config classes.
+
+```python
+from data_designer.plugins import Plugin, PluginType
+
+# Plugin instance - this is what gets loaded via entry point
+plugin = Plugin(
+    task_cls=IndexMultiplierColumnGenerator,
+    config_cls=IndexMultiplierColumnConfig,
+    plugin_type=PluginType.COLUMN_GENERATOR,
+    emoji="🔌",
+)
+```
+
+### Complete plugin code
+
+Pulling it all together, here is the complete plugin code for `src/data_designer_index_multiplier/plugin.py`:
+
+```python
+import logging
+from typing import Literal
+
+import pandas as pd
+
+from data_designer.config.column_configs import SingleColumnConfig
+from data_designer.engine.column_generators.generators.base import (
+    ColumnGenerator,
+    GenerationStrategy,
+    GeneratorMetadata,
+)
+from data_designer.plugins import Plugin, PluginType
+
+# Data Designer uses the standard Python logging module for logging
+logger = logging.getLogger(__name__)
+
+
+class IndexMultiplierColumnConfig(SingleColumnConfig):
+    """Configuration for the index multiplier column generator."""
+
+    # Configurable parameter for this plugin
+    multiplier: int = 2
+
+    # Required: discriminator field with a unique Literal type
+    # This value identifies your plugin and becomes its column_type
+    column_type: Literal["index-multiplier"] = "index-multiplier"
+
+
+class IndexMultiplierColumnGenerator(ColumnGenerator[IndexMultiplierColumnConfig]):
+    @staticmethod
+    def metadata() -> GeneratorMetadata:
+        """Define metadata about this generator."""
+        return GeneratorMetadata(
+            name="index-multiplier",
+            description="Generates values by multiplying the row index by a user-specified multiplier",
+            generation_strategy=GenerationStrategy.FULL_COLUMN,
+            required_resources=None,
+        )
+
+    def generate(self, data: pd.DataFrame) -> pd.DataFrame:
+        """Generate the column data.
+
+        Args:
+            data: The current DataFrame being built
+
+        Returns:
+            The DataFrame with the new column added
+        """
+        logger.info(
+            f"Generating column {self.config.name} "
+            f"with multiplier {self.config.multiplier}"
+        )
+
+        # Access config via self.config
+        data[self.config.name] = data.index * self.config.multiplier
+
+        return data
+
+
+# Plugin instance - this is what gets loaded via entry point
+plugin = Plugin(
+    task_cls=IndexMultiplierColumnGenerator,
+    config_cls=IndexMultiplierColumnConfig,
+    plugin_type=PluginType.COLUMN_GENERATOR,
+    emoji="🔌",
+)
+```
+
+## Step 5: Package your plugin
+
+Create a `pyproject.toml` file to define your package and register the entry point:
+
+```toml
+[project]
+name = "data-designer-index-multiplier"
+version = "1.0.0"
+description = "Data Designer index multiplier plugin"
+requires-python = ">=3.10"
+dependencies = [
+    "data-designer",
+]
+
+# Register this plugin via entry points
+[project.entry-points."data_designer.plugins"]
+index-multiplier = "data_designer_index_multiplier.plugin:plugin"
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build.targets.wheel]
+packages = ["src/data_designer_index_multiplier"]
+```
+
+!!! info "Entry Point Registration"
+    Plugins are discovered automatically using [Python entry points](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata). It is important to register your plugin as an entry point under the `data_designer.plugins` group.
+
+    The entry point format is:
+    ```toml
+    [project.entry-points."data_designer.plugins"]
+    <entry-point-name> = "<module.path>:<plugin-instance-name>"
+    ```
+
+## Step 6: Use your plugin
+
+Install your plugin in editable mode for testing:
+
+```bash
+# From the plugin directory
+uv pip install -e .
+```
+
+Once installed, your plugin works just like built-in column types:
+
+```python
+from data_designer_index_multiplier.plugin import IndexMultiplierColumnConfig
+
+from data_designer.essentials import (
+    CategorySamplerParams,
+    DataDesigner,
+    DataDesignerConfigBuilder,
+    SamplerColumnConfig,
+)
+
+data_designer = DataDesigner()
+builder = DataDesignerConfigBuilder()
+
+# Add a regular column
+builder.add_column(
+    SamplerColumnConfig(
+        name="category",
+        sampler_type="category",
+        params=CategorySamplerParams(values=["A", "B", "C"]),
+    )
+)
+
+# Add your custom plugin column
+builder.add_column(
+    IndexMultiplierColumnConfig(
+        name="v",
+        multiplier=5,
+    )
+)
+
+# Generate data
+results = data_designer.create(builder, num_records=10)
+print(results.load_dataset())
+```
+
+Output:
+```
+  category  multiplied-index
+0        B                 0
+1        A                 5
+2        C                10
+3        A                15
+4        B                20
+...
+```
+
+That's it! You have now created and used your first Data Designer plugin. The last step is to package your plugin and share it with the community 🚀
diff --git a/docs/plugins/overview.md b/docs/plugins/overview.md
@@ -0,0 +1,45 @@
+# Data Designer Plugins
+
+!!! warning "Experimental Feature"
+    The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please [open an issue on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/issues/new/choose).
+
+## What are plugins?
+
+Plugins are Python packages that extend Data Designer's capabilities without modifying the core library. Similar to [VS Code extensions](https://marketplace.visualstudio.com/vscode) and [Pytest plugins](https://docs.pytest.org/en/stable/reference/plugin_list.html), the plugin system empowers you to build specialized extensions for your specific use cases and share them with the community.
+
+**Current capabilities**: Data Designer currently supports plugins for column generators (the column types you pass to the config builder's [add_column](../code_reference/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_column) method).
+
+**Coming soon**: Plugin support for processors, validators, and more!
+
+## How do you use plugins?
+
+A Data Designer plugin is just a Python package configured with an [entry point](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata) that points to a Data Designer `Plugin` object. Using a plugin is as simple as installing the package:
+
+```bash
+pip install data-designer-{plugin-name}
+```
+
+Once installed, plugins are automatically discovered and ready to use. See the [example plugin](example.md) for a complete walkthrough.
+
+## How do you create plugins?
+
+Creating a plugin involves three main steps:
+
+### 1. Implement the Plugin Components
+
+- Create a task class inheriting from `ColumnGenerator`
+- Create a config class inheriting from `SingleColumnConfig`
+- Instantiate a `Plugin` object connecting them
+
+### 2. Package Your Plugin
+
+- Set up a Python package with `pyproject.toml`
+- Register your plugin using entry points
+- Define dependencies (including `data-designer`)
+
+### 3. Share Your Plugin
+
+- Publish to PyPI or another package index
+- Share with the community!
+
+**Ready to get started?** See the [Example Plugin](example.md) for a complete walkthrough!
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -23,13 +23,18 @@ nav:
       - Structured Outputs and Jinja Expressions: notebooks/2-structured-outputs-and-jinja-expressions.ipynb
       - Seeding with an External Dataset: notebooks/3-seeding-with-a-dataset.ipynb
       - Providing Images as Context: notebooks/4-providing-images-as-context.ipynb
+  - Plugins:
+      - Overview: plugins/overview.md
+      - Example Plugin: plugins/example.md
+      - Available Plugin List: plugins/available.md
   - Code Reference:
       - models: code_reference/models.md
       - column_configs: code_reference/column_configs.md
       - config_builder: code_reference/config_builder.md
       - data_designer_config: code_reference/data_designer_config.md
       - sampler_params: code_reference/sampler_params.md
       - validator_params: code_reference/validator_params.md
+      - analysis: code_reference/analysis.md
 
 theme:
   name: material
diff --git a/src/data_designer/engine/dataset_builders/column_wise_builder.py b/src/data_designer/engine/dataset_builders/column_wise_builder.py
@@ -171,6 +171,8 @@ def _run_cell_by_cell_generator(self, generator: ColumnGenerator) -> None:
         max_workers = MAX_CONCURRENCY_PER_NON_LLM_GENERATOR
         if isinstance(generator, WithLLMGeneration):
             max_workers = generator.inference_parameters.max_parallel_requests
+        elif hasattr(generator.config, "max_parallel_requests"):
+            max_workers = generator.config.max_parallel_requests
         self._fan_out_with_threads(generator, max_workers=max_workers)
 
     def _run_full_column_generator(self, generator: ColumnGenerator) -> None:

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# 🚧 Coming Soon`
	`2`	`+`
	`3`	`+This page will list available Data Designer plugins. Stay tuned!`