Skip to content

Commit 01fbf4d

Browse files
andreatgretelnabinchhajohnnygreco
authored
docs: validators etc. (#45)
* got a little help from Claude, will still double check everything * fixing, adding docstrings * forgotten file + overview to tutorial * minor * applying suggestions Co-authored-by: Nabin Mulepati <[email protected]> Co-authored-by: Johnny Greco <[email protected]> * addressing comments pt1 * addressing comments pt2 * trying something out * fix * typo * trying again * rollback workflow, add download links * minor * adapting notebooks to use fakersampler --------- Co-authored-by: Nabin Mulepati <[email protected]> Co-authored-by: Johnny Greco <[email protected]>
1 parent 0be3c10 commit 01fbf4d

File tree

8 files changed

+1514
-1139
lines changed

8 files changed

+1514
-1139
lines changed
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Validator Parameters
2+
3+
When creating a `ValidationColumnConfig`, two parameters are used to define the validator: `validator_type` and `validator_config`.
4+
The `validator_type` parameter can be set to either `code`, `local_callable` or `remote`. The `validator_config` accompanying each of these is, respectively:
5+
6+
::: data_designer.config.validator_params

docs/concepts/validators.md

Lines changed: 340 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,340 @@
1+
# Validators
2+
3+
Validators are quality assurance mechanisms in Data Designer that check generated content against rules and return structured pass/fail results. They enable automated verification of data for correctness, code quality, and adherence to specifications.
4+
5+
!!! note "Quality Gates for Generated Data"
6+
Validators act as **quality gates** in your generation pipeline. Use them to filter invalid records, score code quality, verify format compliance, or integrate with external validation services.
7+
8+
## Overview
9+
10+
Validation columns execute validation logic against target columns and produce structured results indicating:
11+
12+
- **`is_valid`**: Boolean pass/fail status
13+
- **Additional metadata**: Error messages, scores, severity levels, and custom fields
14+
15+
Validators currently support three execution strategies:
16+
17+
1. **Code validation**: Lint and check Python or SQL code using industry-standard tools
18+
2. **Local callable validation**: Execute custom Python functions for flexible validation logic
19+
3. **Remote validation**: Send data to HTTP endpoints for external validation services
20+
21+
## Validator Types
22+
23+
### 🐍 Python Code Validator
24+
25+
The Python code validator runs generated Python code through [Ruff](https://github.com/astral-sh/ruff), a fast Python linter that checks for syntax errors, undefined variables, and code quality issues.
26+
27+
**Configuration:**
28+
29+
```python
30+
from data_designer.essentials import CodeLang, CodeValidatorParams
31+
32+
validator_params = CodeValidatorParams(code_lang=CodeLang.PYTHON)
33+
```
34+
35+
**Validation Output:**
36+
37+
Each validated record returns:
38+
39+
- **`is_valid`**: `True` if no fatal or error-level issues found
40+
- **`python_linter_score`**: Quality score from 0-10 (based on pylint formula)
41+
- **`python_linter_severity`**: Highest severity level found (`"none"`, `"convention"`, `"refactor"`, `"warning"`, `"error"`, `"fatal"`)
42+
- **`python_linter_messages`**: List of linter messages with line numbers, columns, and descriptions
43+
44+
**Severity Levels:**
45+
46+
- **Fatal**: Syntax errors preventing code execution
47+
- **Error**: Undefined names, invalid syntax
48+
- **Warning**: Code smells and potential issues
49+
- **Refactor**: Simplification opportunities
50+
- **Convention**: Style guide violations
51+
52+
A record is marked valid if it has no messages or only messages at warning/convention/refactor levels.
53+
54+
**Example Validation Result:**
55+
56+
```python
57+
{
58+
"is_valid": False,
59+
"python_linter_score": 0,
60+
"python_linter_severity": "error",
61+
"python_linter_messages": [
62+
{
63+
"type": "error",
64+
"symbol": "F821",
65+
"line": 1,
66+
"column": 7,
67+
"message": "Undefined name `it`"
68+
}
69+
]
70+
}
71+
```
72+
73+
### 🗄️ SQL Code Validator
74+
75+
The SQL code validator uses [SQLFluff](https://github.com/sqlfluff/sqlfluff), a dialect-aware SQL linter that checks query syntax and structure.
76+
77+
**Configuration:**
78+
79+
```python
80+
from data_designer.essentials import CodeLang, CodeValidatorParams
81+
82+
validator_params = CodeValidatorParams(code_lang=CodeLang.SQL_POSTGRES)
83+
```
84+
85+
!!! tip "Multiple Dialects"
86+
The SQL code validator supports multiple dialects: `SQL_POSTGRES`, `SQL_ANSI`, `SQL_MYSQL`, `SQL_SQLITE`, `SQL_TSQL` and `SQL_BIGQUERY`.
87+
88+
**Validation Output:**
89+
90+
Each validated record returns:
91+
92+
- **`is_valid`**: `True` if no parsing errors found
93+
- **`error_messages`**: Concatenated error descriptions (empty string if valid)
94+
95+
The validator focuses on parsing errors (PRS codes) that indicate malformed SQL. It also checks for common pitfalls like `DECIMAL` definitions without scale parameters.
96+
97+
**Example Validation Result:**
98+
99+
```python
100+
# Valid SQL
101+
{
102+
"is_valid": True,
103+
"error_messages": ""
104+
}
105+
106+
# Invalid SQL
107+
{
108+
"is_valid": False,
109+
"error_messages": "PRS: Line 1, Position 1: Found unparsable section: 'NOT SQL'"
110+
}
111+
```
112+
113+
### 🔧 Local Callable Validator
114+
115+
The local callable validator executes custom Python functions for flexible validation logic.
116+
117+
**Configuration:**
118+
119+
```python
120+
import pandas as pd
121+
122+
from data_designer.essentials import LocalCallableValidatorParams
123+
124+
def my_validation_function(df: pd.DataFrame) -> pd.DataFrame:
125+
"""Validate that values are positive.
126+
127+
Args:
128+
df: DataFrame with target columns
129+
130+
Returns:
131+
DataFrame with is_valid column and optional metadata
132+
"""
133+
result = pd.DataFrame()
134+
result["is_valid"] = df["price"] > 0
135+
result["error_message"] = result["is_valid"].apply(
136+
lambda valid: "" if valid else "Price must be positive"
137+
)
138+
return result
139+
140+
validator_params = LocalCallableValidatorParams(
141+
validation_function=my_validation_function,
142+
output_schema={ # Optional: enforce output schema
143+
"type": "object",
144+
"properties": {
145+
"data": {
146+
"type": "array",
147+
"items": {
148+
"type": "object",
149+
"properties": {
150+
"is_valid": {"type": ["boolean", "null"]},
151+
"error_message": {"type": "string"}
152+
},
153+
"required": ["is_valid"]
154+
}
155+
}
156+
}
157+
}
158+
)
159+
```
160+
161+
**Function Requirements:**
162+
163+
- **Input**: DataFrame with target columns
164+
- **Output**: DataFrame with `is_valid` column (boolean or null)
165+
- **Extra fields**: Any additional columns become validation metadata
166+
167+
The `output_schema` parameter is optional but recommended—it validates the function's output against a JSON schema, catching unexpected return formats.
168+
169+
### 🌐 Remote Validator
170+
171+
The remote validator sends data to HTTP endpoints for validation-as-a-service. This is useful for when you have validation software that needs to run on external compute and you can expose it through a service. Some examples are:
172+
173+
- External linting services
174+
- Security scanners
175+
- Domain-specific validators
176+
- Proprietary validation systems
177+
178+
!!! note "Authentication"
179+
Currently, the remote validator is only able to perform unauthenticated API calls. When implementing your own service, you can rely on network isolation for security. If you need to reach a service that requires authentication, you should implement a local proxy.
180+
181+
**Configuration:**
182+
183+
```python
184+
from data_designer.essentials import RemoteValidatorParams
185+
186+
validator_params = RemoteValidatorParams(
187+
endpoint_url="https://api.example.com/validate",
188+
timeout=30.0, # Request timeout in seconds
189+
max_retries=3, # Retry attempts on failure
190+
retry_backoff=2.0, # Exponential backoff factor
191+
max_parallel_requests=4, # Concurrent request limit
192+
output_schema={ # Optional: enforce response schema
193+
"type": "object",
194+
"properties": {
195+
"data": {
196+
"type": "array",
197+
"items": {
198+
"type": "object",
199+
"properties": {
200+
"is_valid": {"type": ["boolean", "null"]},
201+
"confidence": {"type": "string"}
202+
}
203+
}
204+
}
205+
}
206+
}
207+
)
208+
```
209+
210+
**Request Format:**
211+
212+
The validator sends POST requests with this structure:
213+
214+
```json
215+
{
216+
"data": [
217+
{"column1": "value1", "column2": "value2"},
218+
{"column1": "value3", "column2": "value4"}
219+
]
220+
}
221+
```
222+
223+
**Expected Response Format:**
224+
225+
The endpoint must return:
226+
227+
```json
228+
{
229+
"data": [
230+
{
231+
"is_valid": true,
232+
"custom_field": "any additional metadata"
233+
},
234+
{
235+
"is_valid": false,
236+
"custom_field": "more metadata"
237+
}
238+
]
239+
}
240+
```
241+
242+
**Retry Behavior:**
243+
244+
The validator automatically retries on:
245+
246+
- Network errors
247+
- HTTP status codes: 429 (rate limit), 500, 502, 503, 504
248+
249+
Failed requests use exponential backoff: `delay = retry_backoff^attempt`.
250+
251+
**Parallelization:**
252+
253+
Set `max_parallel_requests` to control concurrency. Higher values improve throughput but increase server load. The validator batches requests according to the `batch_size` parameter in the validation column configuration.
254+
255+
## Using Validators in Columns
256+
257+
Add validation columns to your configuration using the builder's `add_column` method:
258+
259+
```python
260+
from data_designer.essentials import (
261+
CodeValidatorParams,
262+
CodeLang,
263+
DataDesignerConfigBuilder,
264+
LLMCodeColumnConfig,
265+
ValidationColumnConfig,
266+
)
267+
268+
builder = DataDesignerConfigBuilder()
269+
270+
# Generate Python code
271+
builder.add_column(
272+
LLMCodeColumnConfig(
273+
name="sorting_algorithm",
274+
prompt="Write a Python function to sort a list using bubble sort.",
275+
code_lang="python",
276+
model_alias="my-model"
277+
)
278+
)
279+
280+
# Validate the generated code
281+
builder.add_column(
282+
ValidationColumnConfig(
283+
name="code_validation",
284+
target_columns=["sorting_algorithm"],
285+
validator_type="code",
286+
validator_params=CodeValidatorParams(code_lang=CodeLang.PYTHON),
287+
batch_size=10,
288+
drop=False,
289+
)
290+
)
291+
```
292+
293+
The `target_columns` parameter specifies which columns to validate. All target columns are passed to the validator together (except for code validators, which process each column separately).
294+
295+
### Configuration Parameters
296+
297+
See more about parameters used to instantiate `ValidationColumnConfig` in the [code reference](../../code_reference/column_configs/#data_designer.config.column_configs.ValidationColumnConfig).
298+
299+
### Batch Size Considerations
300+
301+
Larger batch sizes improve efficiency but consume more memory:
302+
303+
- **Code validators**: 5-20 records (file I/O overhead)
304+
- **Local callable**: 10-50 records (depends on function complexity)
305+
- **Remote validators**: 1-10 records (network latency, server capacity)
306+
307+
Adjust based on:
308+
309+
- Validator computational cost
310+
- Available memory
311+
- Network bandwidth (for remote validators)
312+
- Server rate limits
313+
314+
If the validation logic uses information from other samples, only samples in the batch will be considered.
315+
316+
### Multiple Column Validation
317+
318+
Validate multiple columns simultaneously:
319+
320+
```python
321+
from data_designer.essentials import RemoteValidatorParams, ValidationColumnConfig
322+
323+
builder.add_column(
324+
ValidationColumnConfig(
325+
name="multi_column_validation",
326+
target_columns=["column_a", "column_b", "column_c"],
327+
validator_type="remote",
328+
validator_params=RemoteValidatorParams(
329+
endpoint_url="https://api.example.com/validate"
330+
)
331+
)
332+
)
333+
```
334+
335+
**Note**: Code validators always process each target column separately, even when multiple columns are specified. Local callable and remote validators receive all target columns together.
336+
337+
## See Also
338+
339+
- [Validator Parameters Reference](../code_reference/validator_params.md): Configuration object schemas
340+

docs/notebooks/1-the-basics.ipynb

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"## 🎨 Data Designer 101: The Basics\n",
7+
"# 🎨 Data Designer 101: The Basics\n",
8+
"\n",
9+
"[Click here](https://raw.githubusercontent.com/NVIDIA-NeMo/DataDesigner/refs/heads/main/docs/notebooks/1-the-basics.ipynb) to download this notebook to your computer.",
810
"\n",
911
"#### 📚 What you'll learn\n",
1012
"\n",

0 commit comments

Comments
 (0)