Skip to content

Commit 353fb3f

Browse files
authored
Merge pull request #311 from posit-dev/docs-update-validation-methods
docs: update documentation with newer validation methods
2 parents e9aef07 + 6f0f376 commit 353fb3f

File tree

5 files changed

+247
-5
lines changed

5 files changed

+247
-5
lines changed

docs/llms-full.txt

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6907,9 +6907,12 @@ col(exprs: 'str | ColumnSelector | ColumnSelectorNarwhals') -> 'Column | ColumnL
69076907
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
69086908
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
69096909
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
6910+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
6911+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
69106912
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
69116913
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
69126914
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
6915+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
69136916
- [`col_exists()`](`pointblank.Validate.col_exists`)
69146917

69156918
If specifying a single column with certainty (you have the exact name), `col()` is not necessary
@@ -7191,9 +7194,12 @@ starts_with(text: 'str', case_sensitive: 'bool' = False) -> 'StartsWith'
71917194
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
71927195
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
71937196
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7197+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7198+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
71947199
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
71957200
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
71967201
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7202+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
71977203
- [`col_exists()`](`pointblank.Validate.col_exists`)
71987204

71997205
The `starts_with()` selector function doesn't need to be used in isolation. Read the next
@@ -7341,9 +7347,12 @@ ends_with(text: 'str', case_sensitive: 'bool' = False) -> 'EndsWith'
73417347
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
73427348
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
73437349
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7350+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7351+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
73447352
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
73457353
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
73467354
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7355+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
73477356
- [`col_exists()`](`pointblank.Validate.col_exists`)
73487357

73497358
The `ends_with()` selector function doesn't need to be used in isolation. Read the next section
@@ -7492,9 +7501,12 @@ contains(text: 'str', case_sensitive: 'bool' = False) -> 'Contains'
74927501
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
74937502
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
74947503
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7504+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7505+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
74957506
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
74967507
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
74977508
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7509+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
74987510
- [`col_exists()`](`pointblank.Validate.col_exists`)
74997511

75007512
The `contains()` selector function doesn't need to be used in isolation. Read the next section
@@ -7643,9 +7655,12 @@ matches(pattern: 'str', case_sensitive: 'bool' = False) -> 'Matches'
76437655
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
76447656
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
76457657
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7658+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7659+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
76467660
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
76477661
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
76487662
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7663+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
76497664
- [`col_exists()`](`pointblank.Validate.col_exists`)
76507665

76517666
The `matches()` selector function doesn't need to be used in isolation. Read the next section
@@ -7776,9 +7791,12 @@ everything() -> 'Everything'
77767791
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
77777792
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
77787793
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7794+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7795+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
77797796
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
77807797
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
77817798
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7799+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
77827800
- [`col_exists()`](`pointblank.Validate.col_exists`)
77837801

77847802
The `everything()` selector function doesn't need to be used in isolation. Read the next section
@@ -7919,9 +7937,12 @@ first_n(n: 'int', offset: 'int' = 0) -> 'FirstN'
79197937
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
79207938
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
79217939
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7940+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7941+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
79227942
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
79237943
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
79247944
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7945+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
79257946
- [`col_exists()`](`pointblank.Validate.col_exists`)
79267947

79277948
The `first_n()` selector function doesn't need to be used in isolation. Read the next section
@@ -8066,9 +8087,12 @@ last_n(n: 'int', offset: 'int' = 0) -> 'LastN'
80668087
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
80678088
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
80688089
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
8090+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
8091+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
80698092
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
80708093
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
80718094
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
8095+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
80728096
- [`col_exists()`](`pointblank.Validate.col_exists`)
80738097

80748098
The `last_n()` selector function doesn't need to be used in isolation. Read the next section for
@@ -8699,11 +8723,15 @@ get_step_report(self, i: 'int', columns_subset: 'str | list[str] | Column | None
86998723
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
87008724
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
87018725
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
8726+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
8727+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
87028728
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
87038729
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
87048730
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
8731+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
87058732
- [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`)
87068733
- [`conjointly()`](`pointblank.Validate.conjointly`)
8734+
- [`prompt()`](`pointblank.Validate.prompt`)
87078735
- [`rows_complete()`](`pointblank.Validate.rows_complete`)
87088736

87098737
The [`rows_distinct()`](`pointblank.Validate.rows_distinct`) validation step will produce a
@@ -9040,11 +9068,15 @@ get_data_extracts(self, i: 'int | list[int] | None' = None, frame: 'bool' = Fals
90409068
- [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
90419069
- [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
90429070
- [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
9071+
- [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
9072+
- [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
90439073
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
90449074
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
90459075
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
9076+
- [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
90469077
- [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`)
90479078
- [`conjointly()`](`pointblank.Validate.conjointly`)
9079+
- [`prompt()`](`pointblank.Validate.prompt`)
90489080

90499081
An extracted row for these validation methods means that a test unit failed for that row in
90509082
the validation step.

docs/user-guide/validation-methods.qmd

Lines changed: 151 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ to handle diverse data quality requirements. These are grouped into three main c
2626
1. Column Value Validations
2727
2. Row-based Validations
2828
3. Table Structure Validations
29+
4. AI-Powered Validations
2930

3031
Within each of these categories, we'll walk through several examples showing how each validation
3132
method creates steps in your validation plan.
@@ -105,8 +106,11 @@ validating values against predefined sets
105106
- **Null value checks** (`~~Validate.col_vals_null()`, `~~Validate.col_vals_not_null()`) for testing
106107
presence or absence of null values
107108

108-
- **Pattern matching checks** (`~~Validate.col_vals_regex()`) for validating text patterns with
109-
regular expressions
109+
- **Pattern matching checks** (`~~Validate.col_vals_regex()`, `~~Validate.col_vals_within_spec()`)
110+
for validating text patterns with regular expressions or against standard specifications
111+
112+
- **Trending value checks** (`~~Validate.col_vals_increasing()`, `~~Validate.col_vals_decreasing()`)
113+
for verifying that values increase or decrease as you move down the rows
110114

111115
- **Custom expression checks** (`~~Validate.col_vals_expr()`) for complex validations using custom
112116
expressions
@@ -185,6 +189,62 @@ each checking text values in a column:
185189
)
186190
```
187191

192+
### Checking Strings Against Specifications
193+
194+
The `~~Validate.col_vals_within_spec()` method validates column values against common data
195+
specifications like email addresses, URLs, postal codes, credit card numbers, ISBNs, VINs, and
196+
IBANs. This is particularly useful when you need to validate that text data conforms to standard
197+
formats:
198+
199+
```{python}
200+
import polars as pl
201+
202+
# Create a sample table with various data types
203+
sample_data = pl.DataFrame({
204+
"isbn": ["978-0-306-40615-7", "0-306-40615-2", "invalid"],
205+
"email": ["[email protected]", "[email protected]", "not-an-email"],
206+
"zip": ["12345", "90210", "invalid"]
207+
})
208+
209+
(
210+
pb.Validate(data=sample_data)
211+
.col_vals_within_spec(columns="isbn", spec="isbn")
212+
.col_vals_within_spec(columns="email", spec="email")
213+
.col_vals_within_spec(columns="zip", spec="postal_code[US]")
214+
.interrogate()
215+
)
216+
```
217+
218+
### Checking for Trending Values
219+
220+
The `~~Validate.col_vals_increasing()` and `~~Validate.col_vals_decreasing()` validation methods
221+
check whether column values are increasing or decreasing as you move down the rows. These are useful
222+
for validating time series data, sequential identifiers, or any data where you expect monotonic
223+
trends:
224+
225+
```{python}
226+
import polars as pl
227+
228+
# Create a sample table with increasing and decreasing values
229+
trend_data = pl.DataFrame({
230+
"id": [1, 2, 3, 4, 5],
231+
"temperature": [20, 22, 25, 28, 30],
232+
"countdown": [100, 80, 60, 40, 20]
233+
})
234+
235+
(
236+
pb.Validate(data=trend_data)
237+
.col_vals_increasing(columns="id")
238+
.col_vals_increasing(columns="temperature")
239+
.col_vals_decreasing(columns="countdown")
240+
.interrogate()
241+
)
242+
```
243+
244+
The `allow_stationary=` parameter lets you control whether consecutive identical values should pass
245+
validation. By default, stationary values (e.g., `[1, 2, 2, 3]`) will fail the increasing check,
246+
but setting `allow_stationary=True` will allow them to pass.
247+
188248
### Handling Missing Values with `na_pass=`
189249

190250
When validating columns containing Null/None/NA values, you can control how these missing values are
@@ -269,6 +329,7 @@ These structural checks form a foundation for more detailed data quality assessm
269329
- `~~Validate.col_schema_match()`: ensures table matches a defined schema
270330
- `~~Validate.col_count_match()`: confirms the table has the expected number of columns
271331
- `~~Validate.row_count_match()`: verifies the table has the expected number of rows
332+
- `~~Validate.tbl_match()`: validates that the target table matches a comparison table
272333

273334
These structural validations provide essential checks on the fundamental organization of your data
274335
tables, ensuring they have the expected dimensions and components needed for reliable data analysis.
@@ -347,6 +408,36 @@ These parameters all default to `True`, providing strict schema validation. Sett
347408
relaxes the validation requirements, making the checks more flexible when exact matching isn't
348409
necessary or practical for your use case.
349410

411+
### Comparing Tables with `tbl_match()`
412+
413+
The `~~Validate.tbl_match()` validation method provides a comprehensive way to verify that two
414+
tables are identical. It performs a progressive series of checks, from least to most stringent:
415+
416+
1. Column count match
417+
2. Row count match
418+
3. Schema match (loose - case-insensitive, any order)
419+
4. Schema match (order - columns in correct order)
420+
5. Schema match (exact - case-sensitive, correct order)
421+
6. Data match (cell-by-cell comparison)
422+
423+
This progressive approach helps identify exactly where tables differ. Here's an example comparing
424+
the `small_table` dataset with itself:
425+
426+
```{python}
427+
(
428+
pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
429+
.tbl_match(tbl_compare=pb.load_dataset(dataset="small_table", tbl_type="polars"))
430+
.interrogate()
431+
)
432+
```
433+
434+
This validation method is especially useful for:
435+
436+
- Verifying that data transformations preserve expected properties
437+
- Comparing production data against a golden dataset
438+
- Ensuring data consistency across different environments
439+
- Validating that imported data matches source data
440+
350441
### Checking Counts of Row and Columns
351442

352443
Row and column count validations check the number of rows and columns in a table.
@@ -376,10 +467,65 @@ matches a specified count.
376467
Expectations on column and row counts can be useful in certain situations and they align nicely with
377468
schema checks.
378469

470+
## 4. AI-Powered Validations
471+
472+
AI-powered validations use Large Language Models (LLMs) to validate data based on natural language
473+
criteria. This opens up new possibilities for complex validation rules that are difficult to express
474+
with traditional programmatic methods.
475+
476+
### Validating with Natural Language Prompts
477+
478+
The `~~Validate.prompt()` validation method allows you to describe validation criteria in plain
479+
language. The LLM interprets your prompt and evaluates each row, producing pass/fail results just
480+
like other Pointblank validation methods.
481+
482+
This is particularly useful for:
483+
484+
- Semantic checks (e.g., "descriptions should mention a product name")
485+
- Context-dependent validation (e.g., "prices should be reasonable for the product category")
486+
- Subjective quality assessments (e.g., "comments should be professional and constructive")
487+
- Complex rules that would require extensive regex patterns or custom functions
488+
489+
Here's a simple example that validates whether text descriptions contain specific information:
490+
491+
```{python}
492+
#| eval: false
493+
import polars as pl
494+
495+
# Create sample data with product descriptions
496+
products = pl.DataFrame({
497+
"product": ["Widget A", "Gadget B", "Tool C"],
498+
"description": [
499+
"High-quality widget made in USA",
500+
"Innovative gadget with warranty",
501+
"Professional tool"
502+
],
503+
"price": [29.99, 49.99, 19.99]
504+
})
505+
506+
# Validate that descriptions mention quality or features
507+
(
508+
pb.Validate(data=products)
509+
.prompt(
510+
prompt="Each description should mention either quality, features, or warranty",
511+
columns_subset=["description"],
512+
model="anthropic:claude-sonnet-4-5"
513+
)
514+
.interrogate()
515+
)
516+
```
517+
518+
The `columns_subset=` parameter lets you specify which columns to include in the validation,
519+
improving performance and reducing API costs by only sending relevant data to the LLM.
520+
521+
**Note:** To use `~~Validate.prompt()`, you need to have the appropriate API credentials configured
522+
for your chosen LLM provider (Anthropic, OpenAI, Ollama, or AWS Bedrock).
523+
379524
## Conclusion
380525

381526
In this article, we've explored the various types of validation methods that Pointblank offers for
382527
ensuring data quality. These methods provide a framework for validating column values, checking row
383-
properties, and verifying table structures. By combining these validation methods into comprehensive
384-
plans, you can systematically test your data against business rules and quality expectations. And
385-
this all helps to ensure your data remains reliable and trustworthy.
528+
properties, verifying table structures, and even using AI for complex semantic validations. By
529+
combining these validation methods into comprehensive plans, you can systematically test your data
530+
against business rules and quality expectations. And this all helps to ensure your data remains
531+
reliable and trustworthy.

0 commit comments

Comments
 (0)