|
1 | 1 | --- |
2 | | -title: Introduction |
| 2 | +title: Welcome to Pointblank |
3 | 3 | jupyter: python3 |
4 | | -toc-expand: 2 |
5 | 4 | html-table-processing: none |
6 | 5 | --- |
| 6 | + |
| 7 | +<div style="text-align: center;"> |
| 8 | + |
| 9 | +{width=60%} |
| 10 | + |
| 11 | +**Data validation made beautiful and powerful.** |
| 12 | + |
| 13 | +</div> |
| 14 | + |
| 15 | +Pointblank is a data validation framework for Python that makes data quality checks beautiful, |
| 16 | +powerful, and stakeholder-friendly. Instead of cryptic error messages, get stunning interactive |
| 17 | +reports that turn data issues into conversations. |
| 18 | + |
7 | 19 | ```{python} |
8 | 20 | #| echo: false |
9 | 21 | #| output: false |
10 | 22 | import pointblank as pb |
11 | 23 | pb.config(report_incl_footer=False) |
12 | 24 | ``` |
13 | 25 |
|
14 | | -The Pointblank library is all about assessing the state of data quality for a table. You provide the |
15 | | -validation rules and the library will dutifully interrogate the data and provide useful reporting. |
16 | | -We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various |
17 | | -database tables. Let's walk through what data validation looks like in Pointblank. |
18 | | - |
19 | | -## A Simple Validation Table |
20 | | - |
21 | | -This is a validation report table that is produced from a validation of a Polars DataFrame: |
22 | | - |
23 | 26 | ```{python} |
24 | | -#| code-fold: true |
25 | | -#| code-summary: "Show the code" |
| 27 | +#| echo: false |
26 | 28 | import pointblank as pb |
27 | | -
|
28 | | -( |
29 | | - pb.Validate(data=pb.load_dataset(dataset="small_table"), label="Example Validation") |
30 | | - .col_vals_lt(columns="a", value=10) |
31 | | - .col_vals_between(columns="d", left=0, right=5000) |
32 | | - .col_vals_in_set(columns="f", set=["low", "mid", "high"]) |
33 | | - .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$") |
| 29 | +import polars as pl |
| 30 | +
|
| 31 | +validation = ( |
| 32 | + pb.Validate( |
| 33 | + data=pb.load_dataset(dataset="game_revenue", tbl_type="polars"), |
| 34 | + tbl_name="game_revenue", |
| 35 | + label="Comprehensive validation of game revenue data", |
| 36 | + thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35), |
| 37 | + brief=True |
| 38 | + ) |
| 39 | + .col_vals_regex(columns="player_id", pattern=r"^[A-Z]{12}[0-9]{3}$") # STEP 1 |
| 40 | + .col_vals_gt(columns="session_duration", value=20) # STEP 2 |
| 41 | + .col_vals_ge(columns="item_revenue", value=0.20) # STEP 3 |
| 42 | + .col_vals_in_set(columns="item_type", set=["iap", "ad"]) # STEP 4 |
| 43 | + .col_vals_in_set( # STEP 5 |
| 44 | + columns="acquisition", |
| 45 | + set=["google", "facebook", "organic", "crosspromo", "other_campaign"] |
| 46 | + ) |
| 47 | + .col_vals_not_in_set(columns="country", set=["Mongolia", "Germany"]) # STEP 6 |
| 48 | + .col_vals_between( # STEP 7 |
| 49 | + columns="session_duration", |
| 50 | + left=10, right=50, |
| 51 | + pre = lambda df: df.select(pl.median("session_duration")), |
| 52 | + brief="Expect that the median of `session_duration` should be between `10` and `50`." |
| 53 | + ) |
| 54 | + .rows_distinct(columns_subset=["player_id", "session_id", "time"]) # STEP 8 |
| 55 | + .row_count_match(count=2000) # STEP 9 |
| 56 | + .col_count_match(count=11) # STEP 10 |
| 57 | + .col_vals_not_null(columns="item_type") # STEP 11 |
| 58 | + .col_exists(columns="start_day") # STEP 12 |
34 | 59 | .interrogate() |
35 | 60 | ) |
36 | | -``` |
37 | | - |
38 | | -Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side |
39 | | -outlines the validation rules and the right-hand side provides the results of each validation step. |
40 | | -While simple in principle, there's a lot of useful information packed into this validation table. |
41 | | - |
42 | | -Here's a diagram that describes a few of the important parts of the validation table: |
43 | | - |
44 | | -{width=100%} |
45 | | - |
46 | | -There are three things that should be noted here: |
47 | | - |
48 | | -- validation steps: each step is a separate test on the table, focused on a certain aspect of the |
49 | | -table |
50 | | -- validation rules: the validation type is provided here along with key constraints |
51 | | -- validation results: interrogation results are provided here, with a breakdown of test units |
52 | | -(*total*, *passing*, and *failing*), threshold flags, and more |
53 | | - |
54 | | -The intent is to provide the key information in one place, and have it be interpretable by data |
55 | | -stakeholders. For example, a failure can be seen in the second row (notice there's a CSV button). A |
56 | | -data quality stakeholder could click this to download a CSV of the failing rows for that step. |
57 | | - |
58 | | -## Example Code, Step-by-Step |
59 | | - |
60 | | -This section will walk you through the example code used above. |
61 | | - |
62 | | -```python |
63 | | -import pointblank as pb |
64 | 61 |
|
65 | | -( |
66 | | - pb.Validate(data=pb.load_dataset(dataset="small_table")) |
67 | | - .col_vals_lt(columns="a", value=10) |
68 | | - .col_vals_between(columns="d", left=0, right=5000) |
69 | | - .col_vals_in_set(columns="f", set=["low", "mid", "high"]) |
70 | | - .col_vals_regex(columns="b", pattern=r"^[0-9]-[a-z]{3}-[0-9]{3}$") |
71 | | - .interrogate() |
72 | | -) |
| 62 | +validation.get_tabular_report(title="Game Revenue Validation Report").show("browser") |
73 | 63 | ``` |
74 | 64 |
|
75 | | -Note these three key pieces in the code: |
| 65 | +Ready to validate? Start with our [Installation](user-guide/installation.qmd) guide or jump straight |
| 66 | +to the [User Guide](user-guide/index.qmd). |
76 | 67 |
|
77 | | -- **data**: the `Validate(data=)` argument takes a DataFrame or database table that you want to validate |
78 | | -- **steps**: the methods starting with `col_vals_` specify validation steps that run on specific columns |
79 | | -- **execution**: the `~~Validate.interrogate()` method executes the validation plan on the table |
| 68 | +Pointblank is made with 💙 by [Posit](https://posit.co/). |
80 | 69 |
|
81 | | -This common pattern is used in a validation workflow, where `Validate` and |
82 | | -`~~Validate.interrogate()` bookend a validation plan generated through calling validation methods. |
| 70 | +## What is Data Validation? |
83 | 71 |
|
84 | | -In the next few sections we'll go a bit further by understanding how we can measure data quality and |
85 | | -respond to failures. |
| 72 | +Data validation ensures your data meets quality standards before it's used in analysis, reports, or |
| 73 | +downstream systems. Pointblank provides a structured way to define validation rules, execute them, |
| 74 | +and communicate results to both technical and non-technical stakeholders. |
86 | 75 |
|
87 | | -## Understanding Test Units |
| 76 | +With Pointblank you can: |
88 | 77 |
|
89 | | -Each validation step will execute a type of validation test on the target table. For example, a |
90 | | -`~~Validate.col_vals_lt()` validation step can test that each value in a column is less than a |
91 | | -specified number. And the key finding that's reported in each step is the number of *test units* |
92 | | -that pass or fail. |
| 78 | +- **Validate data** through a fluent, chainable API with [25+ validation methods](reference/index.qmd#validation-steps) |
| 79 | +- **Set thresholds** to define acceptable levels of data quality (warning, error, critical) |
| 80 | +- **Take actions** when thresholds are exceeded (notifications, logging, custom functions) |
| 81 | +- **Generate reports** that make data quality issues immediately understandable |
| 82 | +- **Inspect data** with built-in tools for previewing, summarizing, and finding missing values |
93 | 83 |
|
94 | | -In the validation report table, test unit metrics are displayed under the `UNITS`, `PASS`, and |
95 | | -`FAIL` columns. This diagram explains what the tabulated values signify: |
| 84 | +## Why Pointblank? |
96 | 85 |
|
97 | | -{width=100%} |
| 86 | +Pointblank is designed for the entire data team, not just engineers: |
98 | 87 |
|
99 | | -Test units are dependent on the test being run. Some validation methods might test every value in a |
100 | | -particular column, so each value will be a test unit. Others will only have a single test unit since |
101 | | -they aren't testing individual values but rather if the overall test passes or fails. |
| 88 | +🎨 **Beautiful Reports**: Interactive validation reports that stakeholders actually want to read |
| 89 | +📊 **Threshold Management**: Define quality standards with warning, error, and critical levels |
| 90 | +🔍 **Error Drill-Down**: Inspect failing data to get to root causes quickly |
| 91 | +🔗 **Universal Compatibility**: Works with Polars, Pandas, DuckDB, MySQL, PostgreSQL, SQLite, and more |
| 92 | +📝 **YAML Support**: Write validations in YAML for version control and team collaboration |
| 93 | +⚡ **CLI Tools**: Run validations from the command line for CI/CD pipelines or as quick checks |
| 94 | +� **Rich Inspection**: Preview data, analyze columns, and visualize missing values |
102 | 95 |
|
103 | | -## Setting Thresholds for Data Quality Signals |
| 96 | +## Quick Examples |
104 | 97 |
|
105 | | -Understanding test units is essential because they form the foundation of Pointblank's threshold |
106 | | -system. Thresholds let you define acceptable levels of data quality, triggering different severity |
107 | | -signals ('warning', 'error', or 'critical') when certain failure conditions are met. |
| 98 | +### Interactive Reports |
108 | 99 |
|
109 | | -Here's a simple example that uses a single validation step along with thresholds set using the |
110 | | -`Thresholds` class: |
| 100 | +Validation reports aren't just for engineers. They're designed for data stakeholders and are |
| 101 | +highly customizable and publishable as HTML: |
111 | 102 |
|
112 | | -```{python} |
113 | | -( |
114 | | - pb.Validate(data=pb.load_dataset(dataset="small_table")) |
115 | | - .col_vals_lt( |
116 | | - columns="a", |
117 | | - value=7, |
118 | | -
|
119 | | - # Set the 'warning' and 'error' thresholds --- |
120 | | - thresholds=pb.Thresholds(warning=2, error=4) |
121 | | - ) |
122 | | - .interrogate() |
123 | | -) |
| 103 | +```python |
| 104 | +validation.get_tabular_report().show() # In REPL |
| 105 | +validation # In notebooks: it just works |
124 | 106 | ``` |
125 | 107 |
|
126 | | -If you look at the validation report table, we can see: |
127 | | - |
128 | | -- the `FAIL` column shows that 2 tests units have failed |
129 | | -- the `W` column (short for 'warning') shows a filled gray circle indicating those failing test |
130 | | -units reached that threshold value |
131 | | -- the `E` column (short for 'error') shows an open yellow circle indicating that the number of |
132 | | -failing test units is below that threshold |
133 | | - |
134 | | -The one final threshold level, `C` (for 'critical'), wasn't set so it appears on the validation |
135 | | -table as a long dash. |
136 | | - |
137 | | -## Taking Action on Threshold Exceedances |
| 108 | +### Threshold-Based Quality |
138 | 109 |
|
139 | | -Pointblank becomes even more powerful when you combine thresholds with actions. The |
140 | | -`Actions` class lets you trigger responses when validation failures exceed threshold levels, turning |
141 | | -passive reporting into active notifications. |
| 110 | +Set expectations and react when data quality degrades (with alerts, logging, or custom functions): |
142 | 111 |
|
143 | | -Here's a simple example that adds an action to the previous validation: |
144 | | - |
145 | | -```{python} |
146 | | -( |
147 | | - pb.Validate(data=pb.load_dataset(dataset="small_table")) |
148 | | - .col_vals_lt( |
149 | | - columns="a", |
150 | | - value=7, |
151 | | - thresholds=pb.Thresholds(warning=2, error=4), |
152 | | -
|
153 | | - # Set an action for the 'warning' threshold --- |
154 | | - actions=pb.Actions( |
155 | | - warning="WARNING: Column 'a' has values that aren't less than 7." |
156 | | - ) |
157 | | - ) |
| 112 | +```python |
| 113 | +validation = ( |
| 114 | + pb.Validate(data=sales_data, thresholds=(0.01, 0.02, 0.05)) # Three threhold levels set |
| 115 | + .col_vals_not_null(columns="customer_id") |
| 116 | + .col_vals_in_set(columns="status", set=["pending", "shipped", "delivered"]) |
158 | 117 | .interrogate() |
159 | 118 | ) |
160 | 119 | ``` |
161 | 120 |
|
162 | | -Notice the printed warning message: `"WARNING: Column 'a' has values that aren't less than |
163 | | -7."`. The warning indicator (filled gray circle) visually confirms this threshold was reached and |
164 | | -the action should trigger. |
| 121 | +### YAML Workflows |
| 122 | + |
| 123 | +Works wonderfully for CI/CD pipelines and team collaboration: |
165 | 124 |
|
166 | | -Actions make your validation workflows more responsive and integrated with your data pipelines. For |
167 | | -example, you can generate console messages, Slack notifications, and more. |
| 125 | +```yaml |
| 126 | +validate: |
| 127 | + data: sales_data |
| 128 | + tbl_name: "sales_data" |
| 129 | + thresholds: [0.01, 0.02, 0.05] |
168 | 130 |
|
169 | | -## Navigating the User Guide |
| 131 | +steps: |
| 132 | + - col_vals_not_null: |
| 133 | + columns: "customer_id" |
| 134 | + - col_vals_in_set: |
| 135 | + columns: "status" |
| 136 | + set: ["pending", "shipped", "delivered"] |
| 137 | +``` |
170 | 138 |
|
171 | | -As you continue exploring Pointblank's capabilities, you'll find the **User Guide** organized into |
172 | | -sections that will help you navigate the various features. |
| 139 | +```python |
| 140 | +validation = pb.yaml_interrogate("validation.yaml") |
| 141 | +``` |
173 | 142 |
|
174 | | -### Getting Started |
| 143 | +### Command Line Power |
175 | 144 |
|
176 | | -The *Getting Started* section introduces you to Pointblank: |
| 145 | +Run validations without writing code: |
177 | 146 |
|
178 | | -- [Introduction](index.qmd): Overview of Pointblank and core concepts (**this article**) |
179 | | -- [Installation](user-guide/installation.qmd): How to install and set up Pointblank |
| 147 | +```bash |
| 148 | +# Quick validation |
| 149 | +pb validate sales_data.csv --check col-vals-not-null --column customer_id |
180 | 150 |
|
181 | | -### Validation Plan |
| 151 | +# Run YAML workflows |
| 152 | +pb run validation.yaml --exit-code # <- Great for CI/CD! |
182 | 153 |
|
183 | | -The *Validation Plan* section covers everything you need to know about creating robust |
184 | | -validation plans: |
| 154 | +# Explore your data |
| 155 | +pb scan sales_data.csv |
| 156 | +pb missing sales_data.csv |
| 157 | +``` |
185 | 158 |
|
186 | | -- [Overview](user-guide/validation-overview.qmd): Survey of validation methods and their shared parameters |
187 | | -- [Validation Methods](user-guide/validation-methods.qmd): A closer look at the more common validation methods |
188 | | -- [Column Selection Patterns](user-guide/column-selection-patterns.qmd): Techniques for targeting specific columns |
189 | | -- [Preprocessing](user-guide/preprocessing.qmd): Transform data before validation |
190 | | -- [Segmentation](user-guide/segmentation.qmd): Apply validations to specific segments of your data |
191 | | -- [Thresholds](user-guide/thresholds.qmd): Set quality standards and trigger severity levels |
192 | | -- [Actions](user-guide/actions.qmd): Respond to threshold exceedances with notifications or custom functions |
193 | | -- [Briefs](user-guide/briefs.qmd): Add context to validation steps |
| 159 | +## Installation |
194 | 160 |
|
195 | | -### Advanced Validation |
| 161 | +Install Pointblank using pip or conda: |
196 | 162 |
|
197 | | -The *Advanced Validation* section explores more specialized validation techniques: |
| 163 | +```bash |
| 164 | +pip install pointblank |
| 165 | +# or |
| 166 | +conda install conda-forge::pointblank |
| 167 | +``` |
198 | 168 |
|
199 | | -- [Expression-Based Validation](user-guide/expressions.qmd): Use column expressions for advanced validation |
200 | | -- [Schema Validation](user-guide/schema-validation.qmd): Enforce table structure and column types |
201 | | -- [Assertions](user-guide/assertions.qmd): Raise exceptions to enforce data quality requirements |
202 | | -- [Draft Validation](user-guide/draft-validation.qmd): Create validation plans from existing data |
| 169 | +For specific backends: |
203 | 170 |
|
204 | | -### Post Interrogation |
| 171 | +```bash |
| 172 | +pip install "pointblank[pl]" # Polars support |
| 173 | +pip install "pointblank[pd]" # Pandas support |
| 174 | +pip install "pointblank[duckdb]" # DuckDB support |
| 175 | +pip install "pointblank[postgres]" # PostgreSQL support |
| 176 | +``` |
205 | 177 |
|
206 | | -After validating your data, the *Post Interrogation* section helps you analyze and respond to |
207 | | -results: |
| 178 | +See the [Installation guide](user-guide/installation.qmd) for more details. |
208 | 179 |
|
209 | | -- [Validation Reports](user-guide/validation-reports.qmd): Understand and customize the validation report table |
210 | | -- [Step Reports](user-guide/step-reports.qmd): View detailed results for individual validation steps |
211 | | -- [Data Extracts](user-guide/extracts.qmd): Extract and analyze failing data |
212 | | -- [Sundering Validated Data](user-guide/sundering.qmd): Split data based on validation results |
| 180 | +## Join the Community |
213 | 181 |
|
214 | | -### Data Inspection |
| 182 | +We'd love to hear from you! Connect with us: |
215 | 183 |
|
216 | | -The *Data Inspection* section provides tools to explore and understand your data: |
| 184 | +- [GitHub Issues](https://github.com/posit-dev/pointblank/issues) for bug reports and feature requests |
| 185 | +- [Discord server](https://discord.com/invite/YH7CybCNCQ) for discussions and help |
| 186 | +- [Contributing guidelines](https://github.com/posit-dev/pointblank/blob/main/CONTRIBUTING.md) if you'd like to contribute |
217 | 187 |
|
218 | | -- [Previewing Data](user-guide/preview.qmd): View samples of your data |
219 | | -- [Column Summaries](user-guide/col-summary-tbl.qmd): Get statistical summaries of your data |
220 | | -- [Missing Values Reporting](user-guide/missing-vals-tbl.qmd): Identify and visualize missing data |
| 188 | +--- |
221 | 189 |
|
222 | | -By following this guide, you'll gain a comprehensive understanding of how to validate, monitor, and |
223 | | -maintain high-quality data with Pointblank. |
| 190 | +**License**: MIT | **© 2024-2025 Posit Software, PBC** |
0 commit comments