@@ -355,6 +355,57 @@ Template variables available for action strings:
355355 brief: "Values match pattern" # OPTIONAL: Step description
356356` ` `
357357
358+ `col_vals_within_spec` : do column data conform to a specification (email, URL, postal codes, etc.)?
359+
360+ ` ` ` yaml
361+ - col_vals_within_spec:
362+ columns: [column_name] # REQUIRED: Column(s) to validate
363+ spec: "email" # REQUIRED: Specification type
364+ na_pass: false # OPTIONAL: Pass NULL values
365+ pre: | # OPTIONAL: Data preprocessing
366+ lambda df: df.filter(condition)
367+ thresholds: # OPTIONAL: Step-level thresholds
368+ warning: 0.1
369+ actions: # OPTIONAL: Step-level actions
370+ warning: "Custom message"
371+ brief: "Values match spec" # OPTIONAL: Step description
372+ ` ` `
373+
374+ Available specification types :
375+
376+ - ` "email"` - Email addresses
377+ - ` "url"` - Internet URLs
378+ - ` "phone"` - Phone numbers
379+ - ` "ipv4"` - IPv4 addresses
380+ - ` "ipv6"` - IPv6 addresses
381+ - ` "mac"` - MAC addresses
382+ - ` "isbn"` - International Standard Book Numbers (10 or 13 digit)
383+ - ` "vin"` - Vehicle Identification Numbers
384+ - ` "credit_card"` - Credit card numbers (uses Luhn algorithm)
385+ - ` "swift"` - Business Identifier Codes (SWIFT-BIC)
386+ - ` "postal_code[<country_code>]"` - Postal codes for specific countries (e.g., `"postal_code[US]"`, `"postal_code[CA]"`)
387+ - ` "zip"` - Alias for US ZIP codes (`"postal_code[US]"`)
388+ - ` "iban[<country_code>]"` - International Bank Account Numbers (e.g., `"iban[DE]"`, `"iban[FR]"`)
389+
390+ Examples :
391+
392+ ` ` ` yaml
393+ # Email validation
394+ - col_vals_within_spec:
395+ columns: user_email
396+ spec: "email"
397+
398+ # US postal codes
399+ - col_vals_within_spec:
400+ columns: zip_code
401+ spec: "postal_code[US]"
402+
403+ # German IBAN
404+ - col_vals_within_spec:
405+ columns: account_number
406+ spec: "iban[DE]"
407+ ` ` `
408+
358409# ### Custom Expression Methods
359410
360411`col_vals_expr` : do column data agree with a predicate expression?
@@ -375,6 +426,104 @@ Template variables available for action strings:
375426 brief: "Custom validation rule" # OPTIONAL: Step description
376427` ` `
377428
429+ # ### Trend Validation Methods
430+
431+ `col_vals_increasing` : are column data increasing row-by-row?
432+
433+ ` ` ` yaml
434+ - col_vals_increasing:
435+ columns: [column_name] # REQUIRED: Column(s) to validate
436+ allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false)
437+ decreasing_tol: 0.5 # OPTIONAL: Tolerance for negative movement (default: null)
438+ na_pass: false # OPTIONAL: Pass NULL values
439+ pre: | # OPTIONAL: Data preprocessing
440+ lambda df: df.filter(condition)
441+ thresholds: # OPTIONAL: Step-level thresholds
442+ warning: 0.1
443+ actions: # OPTIONAL: Step-level actions
444+ warning: "Custom message"
445+ brief: "Values must increase" # OPTIONAL: Step description
446+ ` ` `
447+
448+ This validation checks whether values in a column increase as you move down the rows. Useful for
449+ validating time-series data, sequence numbers, or any monotonically increasing values.
450+
451+ Parameters :
452+
453+ - `allow_stationary` : If `true`, allows consecutive values to be equal (stationary phases). For
454+ example, `[1, 2, 2, 3]` would pass when `true` but fail at the third value when `false`.
455+ - `decreasing_tol` : Absolute tolerance for negative movement. Setting this to `0.5` means values can
456+ decrease by up to 0.5 units and still pass. Setting any value also sets `allow_stationary` to `true`.
457+
458+ Examples :
459+
460+ ` ` ` yaml
461+ # Strict increasing validation
462+ - col_vals_increasing:
463+ columns: timestamp_seconds
464+ brief: "Timestamps must strictly increase"
465+
466+ # Allow stationary values
467+ - col_vals_increasing:
468+ columns: version_number
469+ allow_stationary: true
470+ brief: "Version numbers should increase (ties allowed)"
471+
472+ # With tolerance for small decreases
473+ - col_vals_increasing:
474+ columns: temperature
475+ decreasing_tol: 0.1
476+ brief: "Temperature trend (small drops allowed)"
477+ ` ` `
478+
479+ `col_vals_decreasing` : are column data decreasing row-by-row?
480+
481+ ` ` ` yaml
482+ - col_vals_decreasing:
483+ columns: [column_name] # REQUIRED: Column(s) to validate
484+ allow_stationary: false # OPTIONAL: Allow consecutive equal values (default: false)
485+ increasing_tol: 0.5 # OPTIONAL: Tolerance for positive movement (default: null)
486+ na_pass: false # OPTIONAL: Pass NULL values
487+ pre: | # OPTIONAL: Data preprocessing
488+ lambda df: df.filter(condition)
489+ thresholds: # OPTIONAL: Step-level thresholds
490+ warning: 0.1
491+ actions: # OPTIONAL: Step-level actions
492+ warning: "Custom message"
493+ brief: "Values must decrease" # OPTIONAL: Step description
494+ ` ` `
495+
496+ This validation checks whether values in a column decrease as you move down the rows. Useful for
497+ countdown timers, inventory depletion, or any monotonically decreasing values.
498+
499+ Parameters :
500+
501+ - `allow_stationary` : If `true`, allows consecutive values to be equal (stationary phases). For
502+ example, `[10, 8, 8, 5]` would pass when `true` but fail at the third value when `false`.
503+ - `increasing_tol` : Absolute tolerance for positive movement. Setting this to `0.5` means values can
504+ increase by up to 0.5 units and still pass. Setting any value also sets `allow_stationary` to `true`.
505+
506+ Examples :
507+
508+ ` ` ` yaml
509+ # Strict decreasing validation
510+ - col_vals_decreasing:
511+ columns: countdown_timer
512+ brief: "Timer must strictly decrease"
513+
514+ # Allow stationary values
515+ - col_vals_decreasing:
516+ columns: priority_score
517+ allow_stationary: true
518+ brief: "Priority scores should decrease (ties allowed)"
519+
520+ # With tolerance for small increases
521+ - col_vals_decreasing:
522+ columns: stock_level
523+ increasing_tol: 5
524+ brief: "Stock levels decrease (small restocks allowed)"
525+ ` ` `
526+
378527# ## Row-based Validations
379528
380529`rows_distinct` : are row data distinct?
@@ -468,6 +617,66 @@ Template variables available for action strings:
468617 brief: "Expected column count" # OPTIONAL: Step description
469618` ` `
470619
620+ `tbl_match` : does the table match a comparison table?
621+
622+ ` ` ` yaml
623+ - tbl_match:
624+ tbl_compare: # REQUIRED: Comparison table
625+ python: |
626+ pb.load_dataset("reference_table", tbl_type="polars")
627+ pre: | # OPTIONAL: Data preprocessing
628+ lambda df: df.filter(condition)
629+ thresholds: # OPTIONAL: Step-level thresholds
630+ warning: 0.0
631+ actions: # OPTIONAL: Step-level actions
632+ warning: "Custom message"
633+ brief: "Table structure matches" # OPTIONAL: Step description
634+ ` ` `
635+
636+ This validation performs a comprehensive comparison between the target table and a comparison table,
637+ using progressively stricter checks :
638+
639+ 1. **Column count match** : both tables have the same number of columns
640+ 2. **Row count match** : both tables have the same number of rows
641+ 3. **Schema match (loose)** : column names and dtypes match (case-insensitive, any order)
642+ 4. **Schema match (order)** : columns in correct order (case-insensitive names)
643+ 5. **Schema match (exact)** : column names match exactly (case-sensitive, correct order)
644+ 6. **Data match** : values in corresponding cells are identical
645+
646+ The validation fails at the first check that doesn't pass, making it easy to diagnose mismatches.
647+ This operates over a single test unit (pass/fail for complete table match).
648+
649+ **Cross-backend validation**: `tbl_match()` supports automatic backend coercion when comparing tables
650+ from different backends (e.g., Polars vs. Pandas, DuckDB vs. SQLite). The comparison table is
651+ automatically converted to match the target table's backend.
652+
653+ Examples :
654+
655+ ` ` ` yaml
656+ # Compare against reference dataset
657+ - tbl_match:
658+ tbl_compare:
659+ python: |
660+ pb.load_dataset("expected_output", tbl_type="polars")
661+ brief: "Output matches expected results"
662+
663+ # Compare against CSV file
664+ - tbl_match:
665+ tbl_compare:
666+ python: |
667+ pl.read_csv("reference_data.csv")
668+ brief: "Matches reference CSV"
669+
670+ # Compare with preprocessing on target table only
671+ - tbl_match:
672+ tbl_compare:
673+ python: |
674+ pb.load_dataset("reference_table", tbl_type="polars")
675+ pre: |
676+ lambda df: df.select(["id", "name", "value"])
677+ brief: "Selected columns match reference"
678+ ` ` `
679+
471680# ## Special Validation Methods
472681
473682`conjointly` : are multiple validations having a joint dependency?
@@ -514,6 +723,121 @@ For Pandas DataFrames (when using `df_library: pandas`):
514723 expr: "lambda df: df.assign(is_valid=df['a'] + df['d'] > 0)"
515724` ` `
516725
726+ # ## AI-Powered Validation
727+
728+ `prompt` : validate rows using AI/LLM-powered analysis
729+
730+ ` ` ` yaml
731+ - prompt:
732+ prompt: "Values should be positive and realistic" # REQUIRED: Natural language criteria
733+ model: "anthropic:claude-sonnet-4" # REQUIRED: Model identifier
734+ columns_subset: [column1, column2] # OPTIONAL: Columns to validate
735+ batch_size: 1000 # OPTIONAL: Rows per batch (default: 1000)
736+ max_concurrent: 3 # OPTIONAL: Concurrent API requests (default: 3)
737+ pre: | # OPTIONAL: Data preprocessing
738+ lambda df: df.filter(condition)
739+ thresholds: # OPTIONAL: Step-level thresholds
740+ warning: 0.1
741+ actions: # OPTIONAL: Step-level actions
742+ warning: "Custom message"
743+ brief: "AI validation" # OPTIONAL: Step description
744+ ` ` `
745+
746+ This validation method uses Large Language Models (LLMs) to validate rows of data based on natural
747+ language criteria. Each row becomes a test unit that either passes or fails the validation criteria,
748+ producing binary True/False results that integrate with standard Pointblank reporting.
749+
750+ **Supported models:**
751+
752+ - **Anthropic**: `"anthropic:claude-sonnet-4"`, `"anthropic:claude-opus-4"`
753+ - **OpenAI**: `"openai:gpt-4"`, `"openai:gpt-4-turbo"`, `"openai:gpt-3.5-turbo"`
754+ - **Ollama**: `"ollama:<model-name>"` (e.g., `"ollama:llama3"`)
755+ - **Bedrock**: `"bedrock:<model-name>"`
756+
757+ **Authentication**: API keys are automatically loaded from environment variables or `.env` files:
758+
759+ - **OpenAI**: Set `OPENAI_API_KEY` environment variable or add to `.env` file
760+ - **Anthropic**: Set `ANTHROPIC_API_KEY` environment variable or add to `.env` file
761+ - **Ollama**: No API key required (runs locally)
762+ - **Bedrock**: Configure AWS credentials through standard AWS methods
763+
764+ Example `.env` file :
765+
766+ ` ` ` plaintext
767+ ANTHROPIC_API_KEY="your_anthropic_api_key_here"
768+ OPENAI_API_KEY="your_openai_api_key_here"
769+ ` ` `
770+
771+ **Performance optimization**: The validation process uses row signature memoization to avoid
772+ redundant LLM calls. When multiple rows have identical values in the selected columns, only one
773+ representative row is validated, and the result is applied to all matching rows. This dramatically
774+ reduces API costs and processing time for datasets with repetitive patterns.
775+
776+ Examples :
777+
778+ ` ` ` yaml
779+ # Basic AI validation
780+ - prompt:
781+ prompt: "Email addresses should look realistic and professional"
782+ model: "anthropic:claude-sonnet-4"
783+ columns_subset: [email]
784+
785+ # Complex semantic validation
786+ - prompt:
787+ prompt: "Product descriptions should mention the product category and include at least one benefit"
788+ model: "openai:gpt-4"
789+ columns_subset: [product_name, description, category]
790+ batch_size: 500
791+ max_concurrent: 5
792+
793+ # Sentiment analysis
794+ - prompt:
795+ prompt: "Customer feedback should express positive sentiment"
796+ model: "anthropic:claude-sonnet-4"
797+ columns_subset: [feedback_text, rating]
798+
799+ # Context-dependent validation
800+ - prompt:
801+ prompt: "For high-value transactions (amount > 1000), a detailed justification should be provided"
802+ model: "openai:gpt-4"
803+ columns_subset: [amount, justification, approver]
804+ thresholds:
805+ warning: 0.05
806+ error: 0.15
807+
808+ # Local model with Ollama
809+ - prompt:
810+ prompt: "Transaction descriptions should be clear and professional"
811+ model: "ollama:llama3"
812+ columns_subset: [description]
813+ ` ` `
814+
815+ **Best practices for AI validation:**
816+
817+ - Be specific and clear in your prompt criteria
818+ - Include only necessary columns in `columns_subset` to reduce API costs
819+ - Start with smaller `batch_size` for testing, increase for production
820+ - Adjust `max_concurrent` based on API rate limits
821+ - Use thresholds appropriate for probabilistic validation results
822+ - Consider cost implications for large datasets
823+ - Test prompts on sample data before full deployment
824+
825+ **When to use AI validation:**
826+
827+ - Semantic checks (e.g., "does the description match the category?")
828+ - Context-dependent validation (e.g., "is the justification appropriate for the amount?")
829+ - Subjective quality assessment (e.g., "is the text professional?")
830+ - Pattern recognition that's hard to express programmatically
831+ - Natural language understanding tasks
832+
833+ **When NOT to use AI validation:**
834+
835+ - Simple numeric comparisons (use `col_vals_gt`, `col_vals_lt`, etc.)
836+ - Exact pattern matching (use `col_vals_regex`)
837+ - Schema validation (use `col_schema_match`)
838+ - Performance-critical validations with large datasets
839+ - When deterministic results are required
840+
517841# # Column Selection Patterns
518842
519843All validation methods that accept a `columns` parameter support these selection patterns :
0 commit comments