|
| 1 | +# Processors |
| 2 | + |
| 3 | +Processors are transformations that modify your dataset before or after columns are generated. They run at different stages and can reshape, filter, or augment the data. |
| 4 | + |
| 5 | +!!! tip "When to Use Processors" |
| 6 | + Processors handle transformations that don't fit the "column" model: restructuring the schema for a specific output format, dropping intermediate columns in bulk, or applying batch-wide operations. |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +Each processor: |
| 11 | + |
| 12 | +- Receives the complete batch DataFrame |
| 13 | +- Applies its transformation |
| 14 | +- Passes the result to the next processor (or to output) |
| 15 | + |
| 16 | +Currently, processors run only at the `POST_BATCH` stage, i.e., after column generation completes for each batch. |
| 17 | + |
| 18 | +## Processor Types |
| 19 | + |
| 20 | +### 🗑️ Drop Columns Processor |
| 21 | + |
| 22 | +Removes specified columns from the output dataset. Dropped columns are saved separately in the `dropped-columns` directory for reference. |
| 23 | + |
| 24 | +!!! tip "Dropping Columns is More Easily Achieved via `drop = True`" |
| 25 | + The Drop Columns Processor is different from others in the sense that it does not need to be explicitly added: setting `drop = True` when configuring a column will accomplish the same. |
| 26 | + |
| 27 | +**Configuration:** |
| 28 | + |
| 29 | +```python |
| 30 | +from data_designer.essentials import DropColumnsProcessorConfig |
| 31 | + |
| 32 | +processor = DropColumnsProcessorConfig( |
| 33 | + name="remove_intermediate", |
| 34 | + column_names=["temp_calculation", "raw_input", "debug_info"], |
| 35 | +) |
| 36 | +``` |
| 37 | + |
| 38 | +**Behavior:** |
| 39 | + |
| 40 | +- Columns specified in `column_names` are removed from the output |
| 41 | +- Original values are preserved in a separate parquet file |
| 42 | +- Missing columns produce a warning but don't fail the build |
| 43 | +- Column configs are automatically marked with `drop=True` when this processor is added |
| 44 | + |
| 45 | +**Use Cases:** |
| 46 | + |
| 47 | +- Removing intermediate columns used only for LLM context |
| 48 | +- Cleaning up debug or validation columns before final output |
| 49 | +- Separating sensitive data from the main dataset |
| 50 | + |
| 51 | +### 🔄 Schema Transform Processor |
| 52 | + |
| 53 | +Creates an additional dataset with a transformed schema using Jinja2 templates. The output is written to a separate directory alongside the main dataset. |
| 54 | + |
| 55 | +**Configuration:** |
| 56 | + |
| 57 | +```python |
| 58 | +from data_designer.essentials import SchemaTransformProcessorConfig |
| 59 | + |
| 60 | +processor = SchemaTransformProcessorConfig( |
| 61 | + name="chat_format", |
| 62 | + template={ |
| 63 | + "messages": [ |
| 64 | + {"role": "user", "content": "{{ question }}"}, |
| 65 | + {"role": "assistant", "content": "{{ answer }}"}, |
| 66 | + ], |
| 67 | + "metadata": "{{ category | upper }}", |
| 68 | + }, |
| 69 | +) |
| 70 | +``` |
| 71 | + |
| 72 | +**Behavior:** |
| 73 | + |
| 74 | +- Each key in `template` becomes a column in the transformed dataset |
| 75 | +- Values are Jinja2 templates with access to all columns in the batch |
| 76 | +- Complex structures (lists, nested dicts) are supported |
| 77 | +- Output is saved to the `processors-outputs/{name}/` directory |
| 78 | +- The original dataset passes through unchanged |
| 79 | + |
| 80 | +**Template Capabilities:** |
| 81 | + |
| 82 | +- **Variable substitution**: `{{ column_name }}` |
| 83 | +- **Filters**: `{{ text | upper }}`, `{{ text | lower }}`, `{{ text | trim }}` |
| 84 | +- **Nested structures**: Arbitrarily deep JSON structures |
| 85 | +- **Lists**: `["{{ col1 }}", "{{ col2 }}"]` |
| 86 | + |
| 87 | +**Use Cases:** |
| 88 | + |
| 89 | +- Converting flat columns to chat message format |
| 90 | +- Restructuring data for specific model training formats |
| 91 | +- Creating derived views without modifying the source dataset |
| 92 | + |
| 93 | +## Using Processors |
| 94 | + |
| 95 | +Add processors to your configuration using the builder's `add_processor` method: |
| 96 | + |
| 97 | +```python |
| 98 | +from data_designer.essentials import ( |
| 99 | + DataDesignerConfigBuilder, |
| 100 | + DropColumnsProcessorConfig, |
| 101 | + SchemaTransformProcessorConfig, |
| 102 | +) |
| 103 | + |
| 104 | +builder = DataDesignerConfigBuilder() |
| 105 | + |
| 106 | +# ... add columns ... |
| 107 | + |
| 108 | +# Drop intermediate columns |
| 109 | +builder.add_processor( |
| 110 | + DropColumnsProcessorConfig( |
| 111 | + name="cleanup", |
| 112 | + column_names=["scratch_work", "raw_context"], |
| 113 | + ) |
| 114 | +) |
| 115 | + |
| 116 | +# Transform to chat format |
| 117 | +builder.add_processor( |
| 118 | + SchemaTransformProcessorConfig( |
| 119 | + name="chat_format", |
| 120 | + template={ |
| 121 | + "messages": [ |
| 122 | + {"role": "user", "content": "{{ question }}"}, |
| 123 | + {"role": "assistant", "content": "{{ answer }}"}, |
| 124 | + ], |
| 125 | + }, |
| 126 | + ) |
| 127 | +) |
| 128 | +``` |
| 129 | + |
| 130 | +### Execution Order |
| 131 | + |
| 132 | +Processors execute in the order they're added. Plan accordingly when one processor's output affects another. |
| 133 | + |
| 134 | +## Configuration Parameters |
| 135 | + |
| 136 | +### Common Parameters |
| 137 | + |
| 138 | +| Parameter | Type | Description | |
| 139 | +|-----------|------|-------------| |
| 140 | +| `name` | str | Identifier for the processor, used in output directory names | |
| 141 | +| `build_stage` | BuildStage | When to run (default: `POST_BATCH`) | |
| 142 | + |
| 143 | +### DropColumnsProcessorConfig |
| 144 | + |
| 145 | +| Parameter | Type | Description | |
| 146 | +|-----------|------|-------------| |
| 147 | +| `column_names` | list[str] | Columns to remove from output | |
| 148 | + |
| 149 | +### SchemaTransformProcessorConfig |
| 150 | + |
| 151 | +| Parameter | Type | Description | |
| 152 | +|-----------|------|-------------| |
| 153 | +| `template` | dict[str, Any] | Jinja2 template defining the output schema. Must be JSON-serializable. | |
0 commit comments