|
| 1 | +# Validators |
| 2 | + |
| 3 | +Validators are quality assurance mechanisms in Data Designer that check generated content against rules and return structured pass/fail results. They enable automated verification of data for correctness, code quality, and adherence to specifications. |
| 4 | + |
| 5 | +!!! note "Quality Gates for Generated Data" |
| 6 | + Validators act as **quality gates** in your generation pipeline. Use them to filter invalid records, score code quality, verify format compliance, or integrate with external validation services. |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +Validation columns execute validation logic against target columns and produce structured results indicating: |
| 11 | + |
| 12 | +- **`is_valid`**: Boolean pass/fail status |
| 13 | +- **Additional metadata**: Error messages, scores, severity levels, and custom fields |
| 14 | + |
| 15 | +Validators currently support three execution strategies: |
| 16 | + |
| 17 | +1. **Code validation**: Lint and check Python or SQL code using industry-standard tools |
| 18 | +2. **Local callable validation**: Execute custom Python functions for flexible validation logic |
| 19 | +3. **Remote validation**: Send data to HTTP endpoints for external validation services |
| 20 | + |
| 21 | +## Validator Types |
| 22 | + |
| 23 | +### 🐍 Python Code Validator |
| 24 | + |
| 25 | +The Python code validator runs generated Python code through [Ruff](https://github.com/astral-sh/ruff), a fast Python linter that checks for syntax errors, undefined variables, and code quality issues. |
| 26 | + |
| 27 | +**Configuration:** |
| 28 | + |
| 29 | +```python |
| 30 | +from data_designer.essentials import CodeLang, CodeValidatorParams |
| 31 | + |
| 32 | +validator_params = CodeValidatorParams(code_lang=CodeLang.PYTHON) |
| 33 | +``` |
| 34 | + |
| 35 | +**Validation Output:** |
| 36 | + |
| 37 | +Each validated record returns: |
| 38 | + |
| 39 | +- **`is_valid`**: `True` if no fatal or error-level issues found |
| 40 | +- **`python_linter_score`**: Quality score from 0-10 (based on pylint formula) |
| 41 | +- **`python_linter_severity`**: Highest severity level found (`"none"`, `"convention"`, `"refactor"`, `"warning"`, `"error"`, `"fatal"`) |
| 42 | +- **`python_linter_messages`**: List of linter messages with line numbers, columns, and descriptions |
| 43 | + |
| 44 | +**Severity Levels:** |
| 45 | + |
| 46 | +- **Fatal**: Syntax errors preventing code execution |
| 47 | +- **Error**: Undefined names, invalid syntax |
| 48 | +- **Warning**: Code smells and potential issues |
| 49 | +- **Refactor**: Simplification opportunities |
| 50 | +- **Convention**: Style guide violations |
| 51 | + |
| 52 | +A record is marked valid if it has no messages or only messages at warning/convention/refactor levels. |
| 53 | + |
| 54 | +**Example Validation Result:** |
| 55 | + |
| 56 | +```python |
| 57 | +{ |
| 58 | + "is_valid": False, |
| 59 | + "python_linter_score": 0, |
| 60 | + "python_linter_severity": "error", |
| 61 | + "python_linter_messages": [ |
| 62 | + { |
| 63 | + "type": "error", |
| 64 | + "symbol": "F821", |
| 65 | + "line": 1, |
| 66 | + "column": 7, |
| 67 | + "message": "Undefined name `it`" |
| 68 | + } |
| 69 | + ] |
| 70 | +} |
| 71 | +``` |
| 72 | + |
| 73 | +### 🗄️ SQL Code Validator |
| 74 | + |
| 75 | +The SQL code validator uses [SQLFluff](https://github.com/sqlfluff/sqlfluff), a dialect-aware SQL linter that checks query syntax and structure. |
| 76 | + |
| 77 | +**Configuration:** |
| 78 | + |
| 79 | +```python |
| 80 | +from data_designer.essentials import CodeLang, CodeValidatorParams |
| 81 | + |
| 82 | +validator_params = CodeValidatorParams(code_lang=CodeLang.SQL_POSTGRES) |
| 83 | +``` |
| 84 | + |
| 85 | +!!! tip "Multiple Dialects" |
| 86 | + The SQL code validator supports multiple dialects: `SQL_POSTGRES`, `SQL_ANSI`, `SQL_MYSQL`, `SQL_SQLITE`, `SQL_TSQL` and `SQL_BIGQUERY`. |
| 87 | + |
| 88 | +**Validation Output:** |
| 89 | + |
| 90 | +Each validated record returns: |
| 91 | + |
| 92 | +- **`is_valid`**: `True` if no parsing errors found |
| 93 | +- **`error_messages`**: Concatenated error descriptions (empty string if valid) |
| 94 | + |
| 95 | +The validator focuses on parsing errors (PRS codes) that indicate malformed SQL. It also checks for common pitfalls like `DECIMAL` definitions without scale parameters. |
| 96 | + |
| 97 | +**Example Validation Result:** |
| 98 | + |
| 99 | +```python |
| 100 | +# Valid SQL |
| 101 | +{ |
| 102 | + "is_valid": True, |
| 103 | + "error_messages": "" |
| 104 | +} |
| 105 | + |
| 106 | +# Invalid SQL |
| 107 | +{ |
| 108 | + "is_valid": False, |
| 109 | + "error_messages": "PRS: Line 1, Position 1: Found unparsable section: 'NOT SQL'" |
| 110 | +} |
| 111 | +``` |
| 112 | + |
| 113 | +### 🔧 Local Callable Validator |
| 114 | + |
| 115 | +The local callable validator executes custom Python functions for flexible validation logic. |
| 116 | + |
| 117 | +**Configuration:** |
| 118 | + |
| 119 | +```python |
| 120 | +import pandas as pd |
| 121 | + |
| 122 | +from data_designer.essentials import LocalCallableValidatorParams |
| 123 | + |
| 124 | +def my_validation_function(df: pd.DataFrame) -> pd.DataFrame: |
| 125 | + """Validate that values are positive. |
| 126 | +
|
| 127 | + Args: |
| 128 | + df: DataFrame with target columns |
| 129 | +
|
| 130 | + Returns: |
| 131 | + DataFrame with is_valid column and optional metadata |
| 132 | + """ |
| 133 | + result = pd.DataFrame() |
| 134 | + result["is_valid"] = df["price"] > 0 |
| 135 | + result["error_message"] = result["is_valid"].apply( |
| 136 | + lambda valid: "" if valid else "Price must be positive" |
| 137 | + ) |
| 138 | + return result |
| 139 | + |
| 140 | +validator_params = LocalCallableValidatorParams( |
| 141 | + validation_function=my_validation_function, |
| 142 | + output_schema={ # Optional: enforce output schema |
| 143 | + "type": "object", |
| 144 | + "properties": { |
| 145 | + "data": { |
| 146 | + "type": "array", |
| 147 | + "items": { |
| 148 | + "type": "object", |
| 149 | + "properties": { |
| 150 | + "is_valid": {"type": ["boolean", "null"]}, |
| 151 | + "error_message": {"type": "string"} |
| 152 | + }, |
| 153 | + "required": ["is_valid"] |
| 154 | + } |
| 155 | + } |
| 156 | + } |
| 157 | + } |
| 158 | +) |
| 159 | +``` |
| 160 | + |
| 161 | +**Function Requirements:** |
| 162 | + |
| 163 | +- **Input**: DataFrame with target columns |
| 164 | +- **Output**: DataFrame with `is_valid` column (boolean or null) |
| 165 | +- **Extra fields**: Any additional columns become validation metadata |
| 166 | + |
| 167 | +The `output_schema` parameter is optional but recommended—it validates the function's output against a JSON schema, catching unexpected return formats. |
| 168 | + |
| 169 | +### 🌐 Remote Validator |
| 170 | + |
| 171 | +The remote validator sends data to HTTP endpoints for validation-as-a-service. This is useful for when you have validation software that needs to run on external compute and you can expose it through a service. Some examples are: |
| 172 | + |
| 173 | +- External linting services |
| 174 | +- Security scanners |
| 175 | +- Domain-specific validators |
| 176 | +- Proprietary validation systems |
| 177 | + |
| 178 | +!!! note "Authentication" |
| 179 | + Currently, the remote validator is only able to perform unauthenticated API calls. When implementing your own service, you can rely on network isolation for security. If you need to reach a service that requires authentication, you should implement a local proxy. |
| 180 | + |
| 181 | +**Configuration:** |
| 182 | + |
| 183 | +```python |
| 184 | +from data_designer.essentials import RemoteValidatorParams |
| 185 | + |
| 186 | +validator_params = RemoteValidatorParams( |
| 187 | + endpoint_url="https://api.example.com/validate", |
| 188 | + timeout=30.0, # Request timeout in seconds |
| 189 | + max_retries=3, # Retry attempts on failure |
| 190 | + retry_backoff=2.0, # Exponential backoff factor |
| 191 | + max_parallel_requests=4, # Concurrent request limit |
| 192 | + output_schema={ # Optional: enforce response schema |
| 193 | + "type": "object", |
| 194 | + "properties": { |
| 195 | + "data": { |
| 196 | + "type": "array", |
| 197 | + "items": { |
| 198 | + "type": "object", |
| 199 | + "properties": { |
| 200 | + "is_valid": {"type": ["boolean", "null"]}, |
| 201 | + "confidence": {"type": "string"} |
| 202 | + } |
| 203 | + } |
| 204 | + } |
| 205 | + } |
| 206 | + } |
| 207 | +) |
| 208 | +``` |
| 209 | + |
| 210 | +**Request Format:** |
| 211 | + |
| 212 | +The validator sends POST requests with this structure: |
| 213 | + |
| 214 | +```json |
| 215 | +{ |
| 216 | + "data": [ |
| 217 | + {"column1": "value1", "column2": "value2"}, |
| 218 | + {"column1": "value3", "column2": "value4"} |
| 219 | + ] |
| 220 | +} |
| 221 | +``` |
| 222 | + |
| 223 | +**Expected Response Format:** |
| 224 | + |
| 225 | +The endpoint must return: |
| 226 | + |
| 227 | +```json |
| 228 | +{ |
| 229 | + "data": [ |
| 230 | + { |
| 231 | + "is_valid": true, |
| 232 | + "custom_field": "any additional metadata" |
| 233 | + }, |
| 234 | + { |
| 235 | + "is_valid": false, |
| 236 | + "custom_field": "more metadata" |
| 237 | + } |
| 238 | + ] |
| 239 | +} |
| 240 | +``` |
| 241 | + |
| 242 | +**Retry Behavior:** |
| 243 | + |
| 244 | +The validator automatically retries on: |
| 245 | + |
| 246 | +- Network errors |
| 247 | +- HTTP status codes: 429 (rate limit), 500, 502, 503, 504 |
| 248 | + |
| 249 | +Failed requests use exponential backoff: `delay = retry_backoff^attempt`. |
| 250 | + |
| 251 | +**Parallelization:** |
| 252 | + |
| 253 | +Set `max_parallel_requests` to control concurrency. Higher values improve throughput but increase server load. The validator batches requests according to the `batch_size` parameter in the validation column configuration. |
| 254 | + |
| 255 | +## Using Validators in Columns |
| 256 | + |
| 257 | +Add validation columns to your configuration using the builder's `add_column` method: |
| 258 | + |
| 259 | +```python |
| 260 | +from data_designer.essentials import ( |
| 261 | + CodeValidatorParams, |
| 262 | + CodeLang, |
| 263 | + DataDesignerConfigBuilder, |
| 264 | + LLMCodeColumnConfig, |
| 265 | + ValidationColumnConfig, |
| 266 | +) |
| 267 | + |
| 268 | +builder = DataDesignerConfigBuilder() |
| 269 | + |
| 270 | +# Generate Python code |
| 271 | +builder.add_column( |
| 272 | + LLMCodeColumnConfig( |
| 273 | + name="sorting_algorithm", |
| 274 | + prompt="Write a Python function to sort a list using bubble sort.", |
| 275 | + code_lang="python", |
| 276 | + model_alias="my-model" |
| 277 | + ) |
| 278 | +) |
| 279 | + |
| 280 | +# Validate the generated code |
| 281 | +builder.add_column( |
| 282 | + ValidationColumnConfig( |
| 283 | + name="code_validation", |
| 284 | + target_columns=["sorting_algorithm"], |
| 285 | + validator_type="code", |
| 286 | + validator_params=CodeValidatorParams(code_lang=CodeLang.PYTHON), |
| 287 | + batch_size=10, |
| 288 | + drop=False, |
| 289 | + ) |
| 290 | +) |
| 291 | +``` |
| 292 | + |
| 293 | +The `target_columns` parameter specifies which columns to validate. All target columns are passed to the validator together (except for code validators, which process each column separately). |
| 294 | + |
| 295 | +### Configuration Parameters |
| 296 | + |
| 297 | +See more about parameters used to instantiate `ValidationColumnConfig` in the [code reference](../../code_reference/column_configs/#data_designer.config.column_configs.ValidationColumnConfig). |
| 298 | + |
| 299 | +### Batch Size Considerations |
| 300 | + |
| 301 | +Larger batch sizes improve efficiency but consume more memory: |
| 302 | + |
| 303 | +- **Code validators**: 5-20 records (file I/O overhead) |
| 304 | +- **Local callable**: 10-50 records (depends on function complexity) |
| 305 | +- **Remote validators**: 1-10 records (network latency, server capacity) |
| 306 | + |
| 307 | +Adjust based on: |
| 308 | + |
| 309 | +- Validator computational cost |
| 310 | +- Available memory |
| 311 | +- Network bandwidth (for remote validators) |
| 312 | +- Server rate limits |
| 313 | + |
| 314 | +If the validation logic uses information from other samples, only samples in the batch will be considered. |
| 315 | + |
| 316 | +### Multiple Column Validation |
| 317 | + |
| 318 | +Validate multiple columns simultaneously: |
| 319 | + |
| 320 | +```python |
| 321 | +from data_designer.essentials import RemoteValidatorParams, ValidationColumnConfig |
| 322 | + |
| 323 | +builder.add_column( |
| 324 | + ValidationColumnConfig( |
| 325 | + name="multi_column_validation", |
| 326 | + target_columns=["column_a", "column_b", "column_c"], |
| 327 | + validator_type="remote", |
| 328 | + validator_params=RemoteValidatorParams( |
| 329 | + endpoint_url="https://api.example.com/validate" |
| 330 | + ) |
| 331 | + ) |
| 332 | +) |
| 333 | +``` |
| 334 | + |
| 335 | +**Note**: Code validators always process each target column separately, even when multiple columns are specified. Local callable and remote validators receive all target columns together. |
| 336 | + |
| 337 | +## See Also |
| 338 | + |
| 339 | +- [Validator Parameters Reference](../code_reference/validator_params.md): Configuration object schemas |
| 340 | + |
0 commit comments