Skip to content

Commit 0416b02

Browse files
authored
Merge pull request #312 from posit-dev/feat-yaml-translations-for-new-validations
feat: YAML support for recently added validations
2 parents 353fb3f + e934b13 commit 0416b02

File tree

4 files changed

+824
-0
lines changed

4 files changed

+824
-0
lines changed

docs/user-guide/yaml-reference.qmd

Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -355,6 +355,57 @@ Template variables available for action strings:
355355
brief: "Values match pattern" # OPTIONAL: Step description
356356
```
357357

358+
`col_vals_within_spec`: do column data conform to a specification (email, URL, postal codes, etc.)?
359+
360+
```yaml
361+
- col_vals_within_spec:
362+
columns: [column_name] # REQUIRED: Column(s) to validate
363+
spec: "email" # REQUIRED: Specification type
364+
na_pass: false # OPTIONAL: Pass NULL values
365+
pre: | # OPTIONAL: Data preprocessing
366+
lambda df: df.filter(condition)
367+
thresholds: # OPTIONAL: Step-level thresholds
368+
warning: 0.1
369+
actions: # OPTIONAL: Step-level actions
370+
warning: "Custom message"
371+
brief: "Values match spec" # OPTIONAL: Step description
372+
```
373+
374+
Available specification types:
375+
376+
- `"email"` - Email addresses
377+
- `"url"` - Internet URLs
378+
- `"phone"` - Phone numbers
379+
- `"ipv4"` - IPv4 addresses
380+
- `"ipv6"` - IPv6 addresses
381+
- `"mac"` - MAC addresses
382+
- `"isbn"` - International Standard Book Numbers (10 or 13 digit)
383+
- `"vin"` - Vehicle Identification Numbers
384+
- `"credit_card"` - Credit card numbers (uses Luhn algorithm)
385+
- `"swift"` - Business Identifier Codes (SWIFT-BIC)
386+
- `"postal_code[<country_code>]"` - Postal codes for specific countries (e.g., `"postal_code[US]"`, `"postal_code[CA]"`)
387+
- `"zip"` - Alias for US ZIP codes (`"postal_code[US]"`)
388+
- `"iban[<country_code>]"` - International Bank Account Numbers (e.g., `"iban[DE]"`, `"iban[FR]"`)
389+
390+
Examples:
391+
392+
```yaml
393+
# Email validation
394+
- col_vals_within_spec:
395+
columns: user_email
396+
spec: "email"
397+
398+
# US postal codes
399+
- col_vals_within_spec:
400+
columns: zip_code
401+
spec: "postal_code[US]"
402+
403+
# German IBAN
404+
- col_vals_within_spec:
405+
columns: account_number
406+
spec: "iban[DE]"
407+
```
408+
358409
#### Custom Expression Methods
359410

360411
`col_vals_expr`: do column data agree with a predicate expression?
@@ -375,6 +426,104 @@ Template variables available for action strings:
375426
brief: "Custom validation rule" # OPTIONAL: Step description
376427
```
377428

429+
#### Trend Validation Methods
430+
431+
`col_vals_increasing`: are column data increasing row-by-row?
432+
433+
```yaml
434+
- col_vals_increasing:
435+
columns: [column_name] # REQUIRED: Column(s) to validate
436+
allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false)
437+
decreasing_tol: 0.5 # OPTIONAL: Tolerance for negative movement (default: null)
438+
na_pass: false # OPTIONAL: Pass NULL values
439+
pre: | # OPTIONAL: Data preprocessing
440+
lambda df: df.filter(condition)
441+
thresholds: # OPTIONAL: Step-level thresholds
442+
warning: 0.1
443+
actions: # OPTIONAL: Step-level actions
444+
warning: "Custom message"
445+
brief: "Values must increase" # OPTIONAL: Step description
446+
```
447+
448+
This validation checks whether values in a column increase as you move down the rows. Useful for
449+
validating time-series data, sequence numbers, or any monotonically increasing values.
450+
451+
Parameters:
452+
453+
- `allow_stationary`: If `true`, allows consecutive values to be equal (stationary phases). For
454+
example, `[1, 2, 2, 3]` would pass when `true` but fail at the third value when `false`.
455+
- `decreasing_tol`: Absolute tolerance for negative movement. Setting this to `0.5` means values can
456+
decrease by up to 0.5 units and still pass. Setting any value also sets `allow_stationary` to `true`.
457+
458+
Examples:
459+
460+
```yaml
461+
# Strict increasing validation
462+
- col_vals_increasing:
463+
columns: timestamp_seconds
464+
brief: "Timestamps must strictly increase"
465+
466+
# Allow stationary values
467+
- col_vals_increasing:
468+
columns: version_number
469+
allow_stationary: true
470+
brief: "Version numbers should increase (ties allowed)"
471+
472+
# With tolerance for small decreases
473+
- col_vals_increasing:
474+
columns: temperature
475+
decreasing_tol: 0.1
476+
brief: "Temperature trend (small drops allowed)"
477+
```
478+
479+
`col_vals_decreasing`: are column data decreasing row-by-row?
480+
481+
```yaml
482+
- col_vals_decreasing:
483+
columns: [column_name] # REQUIRED: Column(s) to validate
484+
allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false)
485+
increasing_tol: 0.5 # OPTIONAL: Tolerance for positive movement (default: null)
486+
na_pass: false # OPTIONAL: Pass NULL values
487+
pre: | # OPTIONAL: Data preprocessing
488+
lambda df: df.filter(condition)
489+
thresholds: # OPTIONAL: Step-level thresholds
490+
warning: 0.1
491+
actions: # OPTIONAL: Step-level actions
492+
warning: "Custom message"
493+
brief: "Values must decrease" # OPTIONAL: Step description
494+
```
495+
496+
This validation checks whether values in a column decrease as you move down the rows. Useful for
497+
countdown timers, inventory depletion, or any monotonically decreasing values.
498+
499+
Parameters:
500+
501+
- `allow_stationary`: If `true`, allows consecutive values to be equal (stationary phases). For
502+
example, `[10, 8, 8, 5]` would pass when `true` but fail at the third value when `false`.
503+
- `increasing_tol`: Absolute tolerance for positive movement. Setting this to `0.5` means values can
504+
increase by up to 0.5 units and still pass. Setting any value also sets `allow_stationary` to `true`.
505+
506+
Examples:
507+
508+
```yaml
509+
# Strict decreasing validation
510+
- col_vals_decreasing:
511+
columns: countdown_timer
512+
brief: "Timer must strictly decrease"
513+
514+
# Allow stationary values
515+
- col_vals_decreasing:
516+
columns: priority_score
517+
allow_stationary: true
518+
brief: "Priority scores should decrease (ties allowed)"
519+
520+
# With tolerance for small increases
521+
- col_vals_decreasing:
522+
columns: stock_level
523+
increasing_tol: 5
524+
brief: "Stock levels decrease (small restocks allowed)"
525+
```
526+
378527
### Row-based Validations
379528

380529
`rows_distinct`: are row data distinct?
@@ -468,6 +617,66 @@ Template variables available for action strings:
468617
brief: "Expected column count" # OPTIONAL: Step description
469618
```
470619

620+
`tbl_match`: does the table match a comparison table?
621+
622+
```yaml
623+
- tbl_match:
624+
tbl_compare: # REQUIRED: Comparison table
625+
python: |
626+
pb.load_dataset("reference_table", tbl_type="polars")
627+
pre: | # OPTIONAL: Data preprocessing
628+
lambda df: df.filter(condition)
629+
thresholds: # OPTIONAL: Step-level thresholds
630+
warning: 0.0
631+
actions: # OPTIONAL: Step-level actions
632+
warning: "Custom message"
633+
brief: "Table structure matches" # OPTIONAL: Step description
634+
```
635+
636+
This validation performs a comprehensive comparison between the target table and a comparison table,
637+
using progressively stricter checks:
638+
639+
1. **Column count match**: both tables have the same number of columns
640+
2. **Row count match**: both tables have the same number of rows
641+
3. **Schema match (loose)**: column names and dtypes match (case-insensitive, any order)
642+
4. **Schema match (order)**: columns in correct order (case-insensitive names)
643+
5. **Schema match (exact)**: column names match exactly (case-sensitive, correct order)
644+
6. **Data match**: values in corresponding cells are identical
645+
646+
The validation fails at the first check that doesn't pass, making it easy to diagnose mismatches.
647+
This operates over a single test unit (pass/fail for complete table match).
648+
649+
**Cross-backend validation**: `tbl_match()` supports automatic backend coercion when comparing tables
650+
from different backends (e.g., Polars vs. Pandas, DuckDB vs. SQLite). The comparison table is
651+
automatically converted to match the target table's backend.
652+
653+
Examples:
654+
655+
```yaml
656+
# Compare against reference dataset
657+
- tbl_match:
658+
tbl_compare:
659+
python: |
660+
pb.load_dataset("expected_output", tbl_type="polars")
661+
brief: "Output matches expected results"
662+
663+
# Compare against CSV file
664+
- tbl_match:
665+
tbl_compare:
666+
python: |
667+
pl.read_csv("reference_data.csv")
668+
brief: "Matches reference CSV"
669+
670+
# Compare with preprocessing on target table only
671+
- tbl_match:
672+
tbl_compare:
673+
python: |
674+
pb.load_dataset("reference_table", tbl_type="polars")
675+
pre: |
676+
lambda df: df.select(["id", "name", "value"])
677+
brief: "Selected columns match reference"
678+
```
679+
471680
### Special Validation Methods
472681

473682
`conjointly`: are multiple validations having a joint dependency?
@@ -514,6 +723,121 @@ For Pandas DataFrames (when using `df_library: pandas`):
514723
expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)"
515724
```
516725

726+
### AI-Powered Validation
727+
728+
`prompt`: validate rows using AI/LLM-powered analysis
729+
730+
```yaml
731+
- prompt:
732+
prompt: "Values should be positive and realistic" # REQUIRED: Natural language criteria
733+
model: "anthropic:claude-sonnet-4" # REQUIRED: Model identifier
734+
columns_subset: [column1, column2] # OPTIONAL: Columns to validate
735+
batch_size: 1000 # OPTIONAL: Rows per batch (default: 1000)
736+
max_concurrent: 3 # OPTIONAL: Concurrent API requests (default: 3)
737+
pre: | # OPTIONAL: Data preprocessing
738+
lambda df: df.filter(condition)
739+
thresholds: # OPTIONAL: Step-level thresholds
740+
warning: 0.1
741+
actions: # OPTIONAL: Step-level actions
742+
warning: "Custom message"
743+
brief: "AI validation" # OPTIONAL: Step description
744+
```
745+
746+
This validation method uses Large Language Models (LLMs) to validate rows of data based on natural
747+
language criteria. Each row becomes a test unit that either passes or fails the validation criteria,
748+
producing binary True/False results that integrate with standard Pointblank reporting.
749+
750+
**Supported models:**
751+
752+
- **Anthropic**: `"anthropic:claude-sonnet-4"`, `"anthropic:claude-opus-4"`
753+
- **OpenAI**: `"openai:gpt-4"`, `"openai:gpt-4-turbo"`, `"openai:gpt-3.5-turbo"`
754+
- **Ollama**: `"ollama:<model-name>"` (e.g., `"ollama:llama3"`)
755+
- **Bedrock**: `"bedrock:<model-name>"`
756+
757+
**Authentication**: API keys are automatically loaded from environment variables or `.env` files:
758+
759+
- **OpenAI**: Set `OPENAI_API_KEY` environment variable or add to `.env` file
760+
- **Anthropic**: Set `ANTHROPIC_API_KEY` environment variable or add to `.env` file
761+
- **Ollama**: No API key required (runs locally)
762+
- **Bedrock**: Configure AWS credentials through standard AWS methods
763+
764+
Example `.env` file:
765+
766+
```plaintext
767+
ANTHROPIC_API_KEY="your_anthropic_api_key_here"
768+
OPENAI_API_KEY="your_openai_api_key_here"
769+
```
770+
771+
**Performance optimization**: The validation process uses row signature memoization to avoid
772+
redundant LLM calls. When multiple rows have identical values in the selected columns, only one
773+
representative row is validated, and the result is applied to all matching rows. This dramatically
774+
reduces API costs and processing time for datasets with repetitive patterns.
775+
776+
Examples:
777+
778+
```yaml
779+
# Basic AI validation
780+
- prompt:
781+
prompt: "Email addresses should look realistic and professional"
782+
model: "anthropic:claude-sonnet-4"
783+
columns_subset: [email]
784+
785+
# Complex semantic validation
786+
- prompt:
787+
prompt: "Product descriptions should mention the product category and include at least one benefit"
788+
model: "openai:gpt-4"
789+
columns_subset: [product_name, description, category]
790+
batch_size: 500
791+
max_concurrent: 5
792+
793+
# Sentiment analysis
794+
- prompt:
795+
prompt: "Customer feedback should express positive sentiment"
796+
model: "anthropic:claude-sonnet-4"
797+
columns_subset: [feedback_text, rating]
798+
799+
# Context-dependent validation
800+
- prompt:
801+
prompt: "For high-value transactions (amount > 1000), a detailed justification should be provided"
802+
model: "openai:gpt-4"
803+
columns_subset: [amount, justification, approver]
804+
thresholds:
805+
warning: 0.05
806+
error: 0.15
807+
808+
# Local model with Ollama
809+
- prompt:
810+
prompt: "Transaction descriptions should be clear and professional"
811+
model: "ollama:llama3"
812+
columns_subset: [description]
813+
```
814+
815+
**Best practices for AI validation:**
816+
817+
- Be specific and clear in your prompt criteria
818+
- Include only necessary columns in `columns_subset` to reduce API costs
819+
- Start with smaller `batch_size` for testing, increase for production
820+
- Adjust `max_concurrent` based on API rate limits
821+
- Use thresholds appropriate for probabilistic validation results
822+
- Consider cost implications for large datasets
823+
- Test prompts on sample data before full deployment
824+
825+
**When to use AI validation:**
826+
827+
- Semantic checks (e.g., "does the description match the category?")
828+
- Context-dependent validation (e.g., "is the justification appropriate for the amount?")
829+
- Subjective quality assessment (e.g., "is the text professional?")
830+
- Pattern recognition that's hard to express programmatically
831+
- Natural language understanding tasks
832+
833+
**When NOT to use AI validation:**
834+
835+
- Simple numeric comparisons (use `col_vals_gt`, `col_vals_lt`, etc.)
836+
- Exact pattern matching (use `col_vals_regex`)
837+
- Schema validation (use `col_schema_match`)
838+
- Performance-critical validations with large datasets
839+
- When deterministic results are required
840+
517841
## Column Selection Patterns
518842

519843
All validation methods that accept a `columns` parameter support these selection patterns:

0 commit comments

Comments
 (0)