@@ -26,6 +26,7 @@ to handle diverse data quality requirements. These are grouped into three main c
26261 . Column Value Validations
27272 . Row-based Validations
28283 . Table Structure Validations
29+ 4 . AI-Powered Validations
2930
3031Within each of these categories, we'll walk through several examples showing how each validation
3132method creates steps in your validation plan.
@@ -105,8 +106,11 @@ validating values against predefined sets
105106- ** Null value checks** (` ~~Validate.col_vals_null() ` , ` ~~Validate.col_vals_not_null() ` ) for testing
106107presence or absence of null values
107108
108- - ** Pattern matching checks** (` ~~Validate.col_vals_regex() ` ) for validating text patterns with
109- regular expressions
109+ - ** Pattern matching checks** (` ~~Validate.col_vals_regex() ` , ` ~~Validate.col_vals_within_spec() ` )
110+ for validating text patterns with regular expressions or against standard specifications
111+
112+ - ** Trending value checks** (` ~~Validate.col_vals_increasing() ` , ` ~~Validate.col_vals_decreasing() ` )
113+ for verifying that values increase or decrease as you move down the rows
110114
111115- ** Custom expression checks** (` ~~Validate.col_vals_expr() ` ) for complex validations using custom
112116expressions
@@ -185,6 +189,62 @@ each checking text values in a column:
185189)
186190```
187191
192+ ### Checking Strings Against Specifications
193+
194+ The ` ~~Validate.col_vals_within_spec() ` method validates column values against common data
195+ specifications like email addresses, URLs, postal codes, credit card numbers, ISBNs, VINs, and
196+ IBANs. This is particularly useful when you need to validate that text data conforms to standard
197+ formats:
198+
199+ ``` {python}
200+ import polars as pl
201+
202+ # Create a sample table with various data types
203+ sample_data = pl.DataFrame({
204+ "isbn": ["978-0-306-40615-7", "0-306-40615-2", "invalid"],
205+ "email": ["[email protected] ", "[email protected] ", "not-an-email"], 206+ "zip": ["12345", "90210", "invalid"]
207+ })
208+
209+ (
210+ pb.Validate(data=sample_data)
211+ .col_vals_within_spec(columns="isbn", spec="isbn")
212+ .col_vals_within_spec(columns="email", spec="email")
213+ .col_vals_within_spec(columns="zip", spec="postal_code[US]")
214+ .interrogate()
215+ )
216+ ```
217+
218+ ### Checking for Trending Values
219+
220+ The ` ~~Validate.col_vals_increasing() ` and ` ~~Validate.col_vals_decreasing() ` validation methods
221+ check whether column values are increasing or decreasing as you move down the rows. These are useful
222+ for validating time series data, sequential identifiers, or any data where you expect monotonic
223+ trends:
224+
225+ ``` {python}
226+ import polars as pl
227+
228+ # Create a sample table with increasing and decreasing values
229+ trend_data = pl.DataFrame({
230+ "id": [1, 2, 3, 4, 5],
231+ "temperature": [20, 22, 25, 28, 30],
232+ "countdown": [100, 80, 60, 40, 20]
233+ })
234+
235+ (
236+ pb.Validate(data=trend_data)
237+ .col_vals_increasing(columns="id")
238+ .col_vals_increasing(columns="temperature")
239+ .col_vals_decreasing(columns="countdown")
240+ .interrogate()
241+ )
242+ ```
243+
244+ The ` allow_stationary= ` parameter lets you control whether consecutive identical values should pass
245+ validation. By default, stationary values (e.g., ` [1, 2, 2, 3] ` ) will fail the increasing check,
246+ but setting ` allow_stationary=True ` will allow them to pass.
247+
188248### Handling Missing Values with ` na_pass= `
189249
190250When validating columns containing Null/None/NA values, you can control how these missing values are
@@ -269,6 +329,7 @@ These structural checks form a foundation for more detailed data quality assessm
269329- ` ~~Validate.col_schema_match() ` : ensures table matches a defined schema
270330- ` ~~Validate.col_count_match() ` : confirms the table has the expected number of columns
271331- ` ~~Validate.row_count_match() ` : verifies the table has the expected number of rows
332+ - ` ~~Validate.tbl_match() ` : validates that the target table matches a comparison table
272333
273334These structural validations provide essential checks on the fundamental organization of your data
274335tables, ensuring they have the expected dimensions and components needed for reliable data analysis.
@@ -347,6 +408,36 @@ These parameters all default to `True`, providing strict schema validation. Sett
347408relaxes the validation requirements, making the checks more flexible when exact matching isn't
348409necessary or practical for your use case.
349410
411+ ### Comparing Tables with ` tbl_match() `
412+
413+ The ` ~~Validate.tbl_match() ` validation method provides a comprehensive way to verify that two
414+ tables are identical. It performs a progressive series of checks, from least to most stringent:
415+
416+ 1 . Column count match
417+ 2 . Row count match
418+ 3 . Schema match (loose - case-insensitive, any order)
419+ 4 . Schema match (order - columns in correct order)
420+ 5 . Schema match (exact - case-sensitive, correct order)
421+ 6 . Data match (cell-by-cell comparison)
422+
423+ This progressive approach helps identify exactly where tables differ. Here's an example comparing
424+ the ` small_table ` dataset with itself:
425+
426+ ``` {python}
427+ (
428+ pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
429+ .tbl_match(tbl_compare=pb.load_dataset(dataset="small_table", tbl_type="polars"))
430+ .interrogate()
431+ )
432+ ```
433+
434+ This validation method is especially useful for:
435+
436+ - Verifying that data transformations preserve expected properties
437+ - Comparing production data against a golden dataset
438+ - Ensuring data consistency across different environments
439+ - Validating that imported data matches source data
440+
350441### Checking Counts of Row and Columns
351442
352443Row and column count validations check the number of rows and columns in a table.
@@ -376,10 +467,65 @@ matches a specified count.
376467Expectations on column and row counts can be useful in certain situations and they align nicely with
377468schema checks.
378469
470+ ## 4. AI-Powered Validations
471+
472+ AI-powered validations use Large Language Models (LLMs) to validate data based on natural language
473+ criteria. This opens up new possibilities for complex validation rules that are difficult to express
474+ with traditional programmatic methods.
475+
476+ ### Validating with Natural Language Prompts
477+
478+ The ` ~~Validate.prompt() ` validation method allows you to describe validation criteria in plain
479+ language. The LLM interprets your prompt and evaluates each row, producing pass/fail results just
480+ like other Pointblank validation methods.
481+
482+ This is particularly useful for:
483+
484+ - Semantic checks (e.g., "descriptions should mention a product name")
485+ - Context-dependent validation (e.g., "prices should be reasonable for the product category")
486+ - Subjective quality assessments (e.g., "comments should be professional and constructive")
487+ - Complex rules that would require extensive regex patterns or custom functions
488+
489+ Here's a simple example that validates whether text descriptions contain specific information:
490+
491+ ``` {python}
492+ #| eval: false
493+ import polars as pl
494+
495+ # Create sample data with product descriptions
496+ products = pl.DataFrame({
497+ "product": ["Widget A", "Gadget B", "Tool C"],
498+ "description": [
499+ "High-quality widget made in USA",
500+ "Innovative gadget with warranty",
501+ "Professional tool"
502+ ],
503+ "price": [29.99, 49.99, 19.99]
504+ })
505+
506+ # Validate that descriptions mention quality or features
507+ (
508+ pb.Validate(data=products)
509+ .prompt(
510+ prompt="Each description should mention either quality, features, or warranty",
511+ columns_subset=["description"],
512+ model="anthropic:claude-sonnet-4-5"
513+ )
514+ .interrogate()
515+ )
516+ ```
517+
518+ The ` columns_subset= ` parameter lets you specify which columns to include in the validation,
519+ improving performance and reducing API costs by only sending relevant data to the LLM.
520+
521+ ** Note:** To use ` ~~Validate.prompt() ` , you need to have the appropriate API credentials configured
522+ for your chosen LLM provider (Anthropic, OpenAI, Ollama, or AWS Bedrock).
523+
379524## Conclusion
380525
381526In this article, we've explored the various types of validation methods that Pointblank offers for
382527ensuring data quality. These methods provide a framework for validating column values, checking row
383- properties, and verifying table structures. By combining these validation methods into comprehensive
384- plans, you can systematically test your data against business rules and quality expectations. And
385- this all helps to ensure your data remains reliable and trustworthy.
528+ properties, verifying table structures, and even using AI for complex semantic validations. By
529+ combining these validation methods into comprehensive plans, you can systematically test your data
530+ against business rules and quality expectations. And this all helps to ensure your data remains
531+ reliable and trustworthy.
0 commit comments