Skip to content

Commit d50a8ae

Browse files
docs: add processors (#147)
* first draft * adding to code reference as well * docstrings * addressing comments * forgot opening line * docstring too
1 parent 60c1aed commit d50a8ae

File tree

4 files changed

+219
-2
lines changed

4 files changed

+219
-2
lines changed

docs/code_reference/processors.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Processors
2+
3+
The `processors` module defines configuration objects for post-generation data transformations. Processors run after column generation and can modify the dataset schema or content before output.
4+
5+
::: data_designer.config.processors
6+

docs/concepts/processors.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# Processors
2+
3+
Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data.
4+
5+
!!! tip "When to Use Processors"
6+
Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations.
7+
8+
## Overview
9+
10+
Each processor:
11+
12+
- Receives the complete batch DataFrame
13+
- Applies its transformation
14+
- Passes the result to the next processor (or to output)
15+
16+
Currently, processors run only at the `POST_BATCH` stage, i.e., after column generation completes for each batch.
17+
18+
## Processor Types
19+
20+
### 🗑️ Drop Columns Processor
21+
22+
Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference.
23+
24+
!!! tip "Dropping Columns is More Easily Achieved via `drop = True`"
25+
The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same.
26+
27+
**Configuration:**
28+
29+
```python
30+
from data_designer.essentials import DropColumnsProcessorConfig
31+
32+
processor = DropColumnsProcessorConfig(
33+
name="remove_intermediate",
34+
column_names=["temp_calculation", "raw_input", "debug_info"],
35+
)
36+
```
37+
38+
**Behavior:**
39+
40+
- Columns specified in `column_names` are removed from the output
41+
- Original values are preserved in a separate parquet file
42+
- Missing columns produce a warning but don't fail the build
43+
- Column configs are automatically marked with `drop=True` when this processor is added
44+
45+
**Use Cases:**
46+
47+
- Removing intermediate columns used only for LLM context
48+
- Cleaning up debug or validation columns before final output
49+
- Separating sensitive data from the main dataset
50+
51+
### 🔄 Schema Transform Processor
52+
53+
Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset.
54+
55+
**Configuration:**
56+
57+
```python
58+
from data_designer.essentials import SchemaTransformProcessorConfig
59+
60+
processor = SchemaTransformProcessorConfig(
61+
name="chat_format",
62+
template={
63+
"messages": [
64+
{"role": "user", "content": "{{ question }}"},
65+
{"role": "assistant", "content": "{{ answer }}"},
66+
],
67+
"metadata": "{{ category | upper }}",
68+
},
69+
)
70+
```
71+
72+
**Behavior:**
73+
74+
- Each key in `template` becomes a column in the transformed dataset
75+
- Values are Jinja2 templates with access to all columns in the batch
76+
- Complex structures (lists, nested dicts) are supported
77+
- Output is saved to the `processors-outputs/{name}/` directory
78+
- The original dataset passes through unchanged
79+
80+
**Template Capabilities:**
81+
82+
- **Variable substitution**: `{{ column_name }}`
83+
- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}`
84+
- **Nested structures**: Arbitrarily deep JSON structures
85+
- **Lists**: `["{{ col1 }}", "{{ col2 }}"]`
86+
87+
**Use Cases:**
88+
89+
- Converting flat columns to chat message format
90+
- Restructuring data for specific model training formats
91+
- Creating derived views without modifying the source dataset
92+
93+
## Using Processors
94+
95+
Add processors to your configuration using the builder's `add_processor` method:
96+
97+
```python
98+
from data_designer.essentials import (
99+
DataDesignerConfigBuilder,
100+
DropColumnsProcessorConfig,
101+
SchemaTransformProcessorConfig,
102+
)
103+
104+
builder = DataDesignerConfigBuilder()
105+
106+
# ... add columns ...
107+
108+
# Drop intermediate columns
109+
builder.add_processor(
110+
DropColumnsProcessorConfig(
111+
name="cleanup",
112+
column_names=["scratch_work", "raw_context"],
113+
)
114+
)
115+
116+
# Transform to chat format
117+
builder.add_processor(
118+
SchemaTransformProcessorConfig(
119+
name="chat_format",
120+
template={
121+
"messages": [
122+
{"role": "user", "content": "{{ question }}"},
123+
{"role": "assistant", "content": "{{ answer }}"},
124+
],
125+
},
126+
)
127+
)
128+
```
129+
130+
### Execution Order
131+
132+
Processors execute in the order they're added. Plan accordingly when one processor's output affects another.
133+
134+
## Configuration Parameters
135+
136+
### Common Parameters
137+
138+
| Parameter | Type | Description |
139+
|-----------|------|-------------|
140+
| `name` | str | Identifier for the processor, used in output directory names |
141+
| `build_stage` | BuildStage | When to run (default: `POST_BATCH`) |
142+
143+
### DropColumnsProcessorConfig
144+
145+
| Parameter | Type | Description |
146+
|-----------|------|-------------|
147+
| `column_names` | list[str] | Columns to remove from output |
148+
149+
### SchemaTransformProcessorConfig
150+
151+
| Parameter | Type | Description |
152+
|-----------|------|-------------|
153+
| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. |

mkdocs.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ nav:
1818
- Inference Parameters: concepts/models/inference-parameters.md
1919
- Columns: concepts/columns.md
2020
- Validators: concepts/validators.md
21+
- Processors: concepts/processors.md
2122
- Person Sampling: concepts/person_sampling.md
2223
- Tutorials:
2324
- Overview: notebooks/README.md
@@ -44,6 +45,7 @@ nav:
4445
- data_designer_config: code_reference/data_designer_config.md
4546
- sampler_params: code_reference/sampler_params.md
4647
- validator_params: code_reference/validator_params.md
48+
- processors: code_reference/processors.md
4749
- analysis: code_reference/analysis.md
4850

4951
theme:

src/data_designer/config/processors.py

Lines changed: 58 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,30 @@
1616

1717

1818
class ProcessorType(str, Enum):
19+
"""Enumeration of available processor types.
20+
21+
Attributes:
22+
DROP_COLUMNS: Processor that removes specified columns from the output dataset.
23+
SCHEMA_TRANSFORM: Processor that creates a new dataset with a transformed schema using Jinja2 templates.
24+
"""
25+
1926
DROP_COLUMNS = "drop_columns"
2027
SCHEMA_TRANSFORM = "schema_transform"
2128

2229

2330
class ProcessorConfig(ConfigBase, ABC):
31+
"""Abstract base class for all processor configuration types.
32+
33+
Processors are transformations that run before or after columns are generated.
34+
They can modify, reshape, or augment the dataset before it's saved.
35+
36+
Attributes:
37+
name: Unique name of the processor, used to identify the processor in results
38+
and to name output artifacts on disk.
39+
build_stage: The stage at which the processor runs. Currently only `POST_BATCH`
40+
is supported, meaning processors run after each batch of columns is generated.
41+
"""
42+
2443
name: str = Field(
2544
description="The name of the processor, used to identify the processor in the results and to write the artifacts to disk.",
2645
)
@@ -38,19 +57,56 @@ def validate_build_stage(cls, v: BuildStage) -> BuildStage:
3857
return v
3958

4059

41-
def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs) -> ProcessorConfig:
60+
def get_processor_config_from_kwargs(processor_type: ProcessorType, **kwargs: Any) -> ProcessorConfig:
61+
"""Create a processor configuration from a processor type and keyword arguments.
62+
63+
Args:
64+
processor_type: The type of processor to create.
65+
**kwargs: Additional keyword arguments passed to the processor constructor.
66+
67+
Returns:
68+
A processor configuration object of the specified type.
69+
"""
4270
if processor_type == ProcessorType.DROP_COLUMNS:
4371
return DropColumnsProcessorConfig(**kwargs)
4472
elif processor_type == ProcessorType.SCHEMA_TRANSFORM:
4573
return SchemaTransformProcessorConfig(**kwargs)
4674

4775

4876
class DropColumnsProcessorConfig(ProcessorConfig):
49-
column_names: list[str]
77+
"""Configuration for dropping columns from the output dataset.
78+
79+
This processor removes specified columns from the generated dataset. The dropped
80+
columns are saved separately in a `dropped-columns` directory for reference.
81+
When this processor is added via the config builder, the corresponding column
82+
configs are automatically marked with `drop = True`.
83+
84+
Alternatively, you can set `drop = True` when configuring a column.
85+
86+
Attributes:
87+
column_names: List of column names to remove from the output dataset.
88+
processor_type: Discriminator field, always `ProcessorType.DROP_COLUMNS` for this configuration type.
89+
"""
90+
91+
column_names: list[str] = Field(description="List of column names to drop from the output dataset.")
5092
processor_type: Literal[ProcessorType.DROP_COLUMNS] = ProcessorType.DROP_COLUMNS
5193

5294

5395
class SchemaTransformProcessorConfig(ProcessorConfig):
96+
"""Configuration for transforming the dataset schema using Jinja2 templates.
97+
98+
This processor creates a new dataset with a transformed schema. Each key in the
99+
template becomes a column in the output, and values are Jinja2 templates that
100+
can reference any column in the batch. The transformed dataset is written to
101+
a `processors-outputs/{processor_name}/` directory alongside the main dataset.
102+
103+
Attributes:
104+
template: Dictionary defining the output schema. Keys are new column names,
105+
values are Jinja2 templates (strings, lists, or nested structures).
106+
Must be JSON-serializable.
107+
processor_type: Discriminator field, always `ProcessorType.SCHEMA_TRANSFORM` for this configuration type.
108+
"""
109+
54110
template: dict[str, Any] = Field(
55111
...,
56112
description="""

0 commit comments

Comments
 (0)