docs: add processors (#147)

andreatgretel · web-flow · commit d50a8aef956a · 2025-12-17T15:47:33.000-03:00
* first draft

* adding to code reference as well

* docstrings

* addressing comments

* forgot opening line

* docstring too
diff --git a/docs/code_reference/processors.md b/docs/code_reference/processors.md
@@ -0,0 +1,6 @@
+# Processors
+
+The `processors` module defines configuration objects for post-generation data transformations. Processors run after column generation and can modify the dataset schema or content before output.
+
+::: data_designer.config.processors
+
diff --git a/docs/concepts/processors.md b/docs/concepts/processors.md
@@ -0,0 +1,153 @@
+# Processors
+
+Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.
+
+!!! tip "When to Use Processors"
+    Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.
+
+## Overview
+
+Each processor:
+
+- Receives the complete batch DataFrame
+- Applies its transformation
+- Passes the result to the next processor (or to output)
+
+Currently, processors run only at the `POST_BATCH` stage, i.e., after column generation completes for each batch.
+
+## Processor Types
+
+### 🗑️ Drop Columns Processor
+
+Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference.
+
+!!! tip "Dropping Columns is More Easily Achieved via `drop = True`"
+    The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same.
+
+**Configuration:**
+
+```python
+from data_designer.essentials import DropColumnsProcessorConfig
+
+processor = DropColumnsProcessorConfig(
+    name="remove_intermediate",
+    column_names=["temp_calculation", "raw_input", "debug_info"],
+)
+```
+
+**Behavior:**
+
+- Columns specified in `column_names` are removed from the output
+- Original values are preserved in a separate parquet file
+- Missing columns produce a warning but don't fail the build
+- Column configs are automatically marked with `drop=True` when this processor is added
+
+**Use Cases:**
+
+- Removing intermediate columns used only for LLM context
+- Cleaning up debug or validation columns before final output
+- Separating sensitive data from the main dataset
+
+### 🔄 Schema Transform Processor
+
+Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.
+
+**Configuration:**
+
+```python
+from data_designer.essentials import SchemaTransformProcessorConfig
+
+processor = SchemaTransformProcessorConfig(
+    name="chat_format",
+    template={
+        "messages": [
+            {"role": "user", "content": "{{ question }}"},
+            {"role": "assistant", "content": "{{ answer }}"},
+        ],
+        "metadata": "{{ category | upper }}",
+    },
+)
+```
+
+**Behavior:**
+
+- Each key in `template` becomes a column in the transformed dataset
+- Values are Jinja2 templates with access to all columns in the batch
+- Complex structures (lists, nested dicts) are supported
+- Output is saved to the `processors-outputs/{name}/` directory
+- The original dataset passes through unchanged
+
+**Template Capabilities:**
+
+- **Variable substitution**: `{{ column_name }}`
+- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}`
+- **Nested structures**: Arbitrarily deep JSON structures
+- **Lists**: `["{{ col1 }}", "{{ col2 }}"]`
+
+**Use Cases:**
+
+- Converting flat columns to chat message format
+- Restructuring data for specific model training formats
+- Creating derived views without modifying the source dataset
+
+## Using Processors
+
+Add processors to your configuration using the builder's `add_processor` method:
+
+```python
+from data_designer.essentials import (
+    DataDesignerConfigBuilder,
+    DropColumnsProcessorConfig,
+    SchemaTransformProcessorConfig,
+)
+
+builder = DataDesignerConfigBuilder()
+
+# ... add columns ...
+
+# Drop intermediate columns
+builder.add_processor(
+    DropColumnsProcessorConfig(
+        name="cleanup",
+        column_names=["scratch_work", "raw_context"],
+    )
+)
+
+# Transform to chat format
+builder.add_processor(
+    SchemaTransformProcessorConfig(
+        name="chat_format",
+        template={
+            "messages": [
+                {"role": "user", "content": "{{ question }}"},
+                {"role": "assistant", "content": "{{ answer }}"},
+            ],
+        },
+    )
+)
+```
+
+### Execution Order
+
+Processors execute in the order they're added. Plan accordingly when one processor's output affects another.
+
+## Configuration Parameters
+
+### Common Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `name` | str | Identifier for the processor, used in output directory names |
+| `build_stage` | BuildStage | When to run (default: `POST_BATCH`) |
+
+### DropColumnsProcessorConfig
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `column_names` | list[str] | Columns to remove from output |
+
+### SchemaTransformProcessorConfig
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. |
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -18,6 +18,7 @@ nav:
           - Inference Parameters: concepts/models/inference-parameters.md
       - Columns: concepts/columns.md
       - Validators: concepts/validators.md
+      - Processors: concepts/processors.md
       - Person Sampling: concepts/person_sampling.md
   - Tutorials:
       - Overview: notebooks/README.md
@@ -44,6 +45,7 @@ nav:
       - data_designer_config: code_reference/data_designer_config.md
       - sampler_params: code_reference/sampler_params.md
       - validator_params: code_reference/validator_params.md
+      - processors: code_reference/processors.md
       - analysis: code_reference/analysis.md
 
 theme:
diff --git a/src/data_designer/config/processors.py b/src/data_designer/config/processors.py
@@ -16,11 +16,30 @@
 
 
 class ProcessorType(str, Enum):
+    """Enumeration of available processor types.
+
+    Attributes:
+        DROP_COLUMNS: Processor that removes specified columns from the output dataset.
+        SCHEMA_TRANSFORM: Processor that creates a new dataset with a transformed schema using Jinja2 templates.
+    """
+
     DROP_COLUMNS = "drop_columns"
     SCHEMA_TRANSFORM = "schema_transform"
 
 
 class ProcessorConfig(ConfigBase, ABC):
+    """Abstract base class for all processor configuration types.
+
+    Processors are transformations that run before or after columns are generated.
+    They can modify, reshape, or augment the dataset before it's saved.
+
+    Attributes:
+        name: Unique name of the processor, used to identify the processor in results
+            and to name output artifacts on disk.
+        build_stage: The stage at which the processor runs. Currently only `POST_BATCH`
+            is supported, meaning processors run after each batch of columns is generated.
+    """
+
     name: str = Field(
         description="The name of the processor, used to identify the processor in the results and to write the artifacts to disk.",
     )
@@ -38,19 +57,56 @@ def validate_build_stage(cls, v: BuildStage) -> BuildStage:
         return v
 
 
-def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs) -> ProcessorConfig:
+def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs: Any) -> ProcessorConfig:
+    """Create a processor configuration from a processor type and keyword arguments.
+
+    Args:
+        processor_type: The type of processor to create.
+        **kwargs: Additional keyword arguments passed to the processor constructor.
+
+    Returns:
+        A processor configuration object of the specified type.
+    """
     if processor_type == ProcessorType.DROP_COLUMNS:
         return DropColumnsProcessorConfig(**kwargs)
     elif processor_type == ProcessorType.SCHEMA_TRANSFORM:
         return SchemaTransformProcessorConfig(**kwargs)
 
 
 class DropColumnsProcessorConfig(ProcessorConfig):
-    column_names: list[str]
+    """Configuration for dropping columns from the output dataset.
+
+    This processor removes specified columns from the generated dataset. The dropped
+    columns are saved separately in a `dropped-columns` directory for reference.
+    When this processor is added via the config builder, the corresponding column
+    configs are automatically marked with `drop = True`.
+
+    Alternatively, you can set `drop = True` when configuring a column.
+
+    Attributes:
+        column_names: List of column names to remove from the output dataset.
+        processor_type: Discriminator field, always `ProcessorType.DROP_COLUMNS` for this configuration type.
+    """
+
+    column_names: list[str] = Field(description="List of column names to drop from the output dataset.")
     processor_type: Literal[ProcessorType.DROP_COLUMNS] = ProcessorType.DROP_COLUMNS
 
 
 class SchemaTransformProcessorConfig(ProcessorConfig):
+    """Configuration for transforming the dataset schema using Jinja2 templates.
+
+    This processor creates a new dataset with a transformed schema. Each key in the
+    template becomes a column in the output, and values are Jinja2 templates that
+    can reference any column in the batch. The transformed dataset is written to
+    a `processors-outputs/{processor_name}/` directory alongside the main dataset.
+
+    Attributes:
+        template: Dictionary defining the output schema. Keys are new column names,
+            values are Jinja2 templates (strings, lists, or nested structures).
+            Must be JSON-serializable.
+        processor_type: Discriminator field, always `ProcessorType.SCHEMA_TRANSFORM` for this configuration type.
+    """
+
     template: dict[str, Any] = Field(
         ...,
         description="""