Skip to content

Commit 38f95c6

Browse files
authored
Merge pull request #345 from posit-dev/feat-data-freshness
feat: add the `data_freshness()` validation method
2 parents b7b9753 + 1598824 commit 38f95c6

File tree

7 files changed

+2396
-0
lines changed

7 files changed

+2396
-0
lines changed

docs/_quarto.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,7 @@ quartodoc:
194194
- name: Validate.rows_complete
195195
- name: Validate.col_exists
196196
- name: Validate.col_pct_null
197+
- name: Validate.data_freshness
197198
- name: Validate.col_schema_match
198199
- name: Validate.row_count_match
199200
- name: Validate.col_count_match

docs/user-guide/validation-methods.qmd

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,7 @@ These structural checks form a foundation for more detailed data quality assessm
357357
- `~~Validate.col_count_match()`: confirms the table has the expected number of columns
358358
- `~~Validate.row_count_match()`: verifies the table has the expected number of rows
359359
- `~~Validate.tbl_match()`: validates that the target table matches a comparison table
360+
- `~~Validate.data_freshness()`: checks that data is recent and not stale
360361

361362
These structural validations provide essential checks on the fundamental organization of your data
362363
tables, ensuring they have the expected dimensions and components needed for reliable data analysis.
@@ -494,6 +495,70 @@ matches a specified count.
494495
Expectations on column and row counts can be useful in certain situations and they align nicely with
495496
schema checks.
496497

498+
### Validating Data Freshness
499+
500+
Late or missing data is one of the most common (and costly) data quality issues in production
501+
systems. When data pipelines fail silently or experience delays, downstream analytics and ML models
502+
can produce stale or misleading results. The `~~Validate.data_freshness()` validation method helps
503+
catch these issues early by verifying that your data contains recent records.
504+
505+
Data freshness validation works by checking a datetime column against a maximum allowed age. If the
506+
most recent timestamp in that column is older than the specified threshold, the validation fails.
507+
This simple check can prevent major downstream problems caused by stale data.
508+
509+
Here's an example that validates data is no older than 2 days:
510+
511+
```{python}
512+
import polars as pl
513+
from datetime import datetime, timedelta
514+
515+
# Simulate a data feed that should be updated daily
516+
recent_data = pl.DataFrame({
517+
"event": ["login", "purchase", "logout", "signup"],
518+
"event_time": [
519+
datetime.now() - timedelta(hours=1),
520+
datetime.now() - timedelta(hours=6),
521+
datetime.now() - timedelta(hours=12),
522+
datetime.now() - timedelta(hours=18),
523+
],
524+
"user_id": [101, 102, 103, 104]
525+
})
526+
527+
(
528+
pb.Validate(data=recent_data)
529+
.data_freshness(column="event_time", max_age="2d")
530+
.interrogate()
531+
)
532+
```
533+
534+
The `max_age=` parameter accepts a flexible string format: `"30m"` for 30 minutes, `"6h"` for 6
535+
hours, `"2d"` for 2 days, or `"1w"` for 1 week. You can also combine units: `"1d 12h"` for 1.5 days.
536+
537+
When validation succeeds, the report includes details about the data's age in the footer. When it
538+
fails, you'll see exactly how old the most recent data is and what threshold was exceeded. This
539+
context helps quickly diagnose whether you're dealing with a minor delay or a major pipeline
540+
failure.
541+
542+
Data freshness validation is particularly valuable for:
543+
544+
- monitoring ETL pipelines to catch failures before they cascade to reports and dashboards
545+
- validating data feeds to ensure third-party data sources are delivering as expected
546+
- including freshness checks in automated data quality tests as part of continuous integration
547+
- building alerting systems that trigger notifications when critical data becomes stale
548+
549+
You might wonder why not just use `~~Validate.col_vals_gt()` with a datetime threshold. While that
550+
approach works, `~~Validate.data_freshness()` offers several advantages: the method name clearly
551+
communicates your intent, the `max_age=` string format (e.g., `"2d"`) is more readable than datetime
552+
arithmetic, it auto-generates meaningful validation briefs, the report footer shows helpful context
553+
about actual data age and thresholds, and timezone mismatches between your data and comparison time
554+
are handled gracefully with informative warnings.
555+
556+
::: {.callout-note}
557+
When comparing timezone-aware and timezone-naive datetimes, Pointblank will include a warning in the
558+
validation report. For consistent results, ensure your data and comparison times use compatible
559+
timezone settings.
560+
:::
561+
497562
## 4. AI-Powered Validations
498563

499564
AI-powered validations use Large Language Models (LLMs) to validate data based on natural language

pointblank/_constants.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@
4949
"col_schema_match": "col_schema_match",
5050
"row_count_match": "row_count_match",
5151
"col_count_match": "col_count_match",
52+
"data_freshness": "data_freshness",
5253
"tbl_match": "tbl_match",
5354
"conjointly": "conjointly",
5455
"specially": "specially",
@@ -725,6 +726,19 @@
725726
<path d="M11.5931863,12.5146694 C11.3836625,12.5146694 10.212234,12.5646694 10.212234,13.8480027 L10.212234,53.181336 C10.212234,54.4646694 11.3836625,54.5146694 11.5931863,54.5146694 L14.1646149,54.5146694 L14.1646149,12.5146694 L11.5931863,12.5146694 Z M20.1721771,12.5146694 L20.1721771,54.5146694 L16.2522908,54.5146694 L16.2522908,54.5146694 L16.2522908,12.5146694 L16.2522908,12.5146694 L20.1721771,12.5146694 Z M24.8656149,12.5150904 C25.1448786,12.521763 26.212234,12.6230027 26.212234,13.8480027 L26.212234,13.8480027 L26.212234,53.181336 C26.212234,54.4646694 25.0408054,54.5146694 24.8312816,54.5146694 L24.8312816,54.5146694 L22.259853,54.5146694 L22.259853,12.5146694 Z" id="rows_one" fill="#000000" fill-rule="nonzero" transform="translate(18.212234, 33.514669) rotate(-180.000000) translate(-18.212234, -33.514669) "></path>
726727
</g>
727728
</g>
729+
</svg>""",
730+
"data_freshness": """<?xml version="1.0" encoding="UTF-8"?>
731+
<svg width="67px" height="67px" viewBox="0 0 67 67" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
732+
<title>data_freshness</title>
733+
<g id="All-Icons" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
734+
<g id="data_freshness" transform="translate(0.000000, 0.275862)">
735+
<path d="M56.712234,1 C59.1975153,1 61.4475153,2.00735931 63.076195,3.63603897 C64.7048747,5.26471863 65.712234,7.51471863 65.712234,10 L65.712234,10 L65.712234,65 L10.712234,65 C8.22695259,65 5.97695259,63.9926407 4.34827294,62.363961 C2.71959328,60.7352814 1.71223397,58.4852814 1.71223397,56 L1.71223397,56 L1.71223397,10 C1.71223397,7.51471863 2.71959328,5.26471863 4.34827294,3.63603897 C5.97695259,2.00735931 8.22695259,1 10.712234,1 L10.712234,1 Z" id="rectangle" stroke="#000000" stroke-width="2" fill="#FFFFFF"></path>
736+
<circle id="clock-face" stroke="#000000" stroke-width="2" cx="33.5" cy="33" r="20"></circle>
737+
<line x1="33.5" y1="33" x2="33.5" y2="20" id="hour-hand" stroke="#000000" stroke-width="2.5" stroke-linecap="round"></line>
738+
<line x1="33.5" y1="33" x2="44" y2="33" id="minute-hand" stroke="#000000" stroke-width="2" stroke-linecap="round"></line>
739+
<circle id="center-dot" fill="#000000" cx="33.5" cy="33" r="2"></circle>
740+
</g>
741+
</g>
728742
</svg>""",
729743
"tbl_match": """<?xml version="1.0" encoding="UTF-8"?>
730744
<svg width="67px" height="67px" viewBox="0 0 67 67" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">

0 commit comments

Comments
 (0)