You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/get-started/index.qmd
+21-10Lines changed: 21 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -4,14 +4,14 @@ jupyter: python3
4
4
html-table-processing: none
5
5
---
6
6
7
-
To assess the state of data quality for a table, we use the `Validate` class to collect our validation instructions and then perform the interrogation. After interrogation, we have an object that can produce reporting or enable further processing of the input table. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or a selection of DB tables. Let's walk through what a table validation looks like in pointblank!
7
+
The pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or a selection of DB tables. Let's walk through what table validation looks like in pointblank!
8
8
9
9
## A Simple Example with the Basics
10
10
11
-
This is a validation table that's checking a Polars DataFrame:
11
+
This is a validation table that produced from a validation of a Polars DataFrame:
12
12
13
13
```{python}
14
-
#| code-fold: true
14
+
#| code-fold: true
15
15
16
16
import pointblank as pb
17
17
@@ -27,25 +27,36 @@ validation_1 = (
27
27
validation_1.get_tabular_report()
28
28
```
29
29
30
-
Each row is a validation step. The left-hand side outlines the validation rules. The right-hand side provides the results of each validation step.
30
+
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step.
31
31
32
-
While simple in principle, there's a lot of useful information packed into this validation table! The bright green color strips at the left of each validation step indicates that all test units passed validation. The lighter green color in the second step means that there was at least one failing unit. What are test units? Each validation step could perform one or many atomic tests (e.g., one test per cell in column). It's quite a bit to take in, so here's a diagram that describes the different parts of the validation table:
32
+
While simple in principle, there's a lot of useful information packed into this validation table! The bright green color strips at the left of each validation step indicates that all test units passed validation. The lighter green color in the second step means that there was at least one failing unit. What are test units? Each validation step could perform one or many atomic tests (e.g., one test per cell in column). It's quite a bit to take in, so here's a diagram that describes a few of the important parts of the validation table:
The code that performs the validation on the Polars table can be revealed by interacting with the `Code` disclosure triangle. Here's a rundown of how it all works in three steps.
36
+
The code that performs the validation on the Polars table can be revealed by interacting with the `Code` disclosure triangle up above. Here's a rundown of how it all works, in three steps.
37
37
38
38
#### Step 1
39
39
40
-
The object that we need for this workflow is created with the `Validate` class. Such an object can handle one target table at any given time and the `data=` argument is where the table is specified.
40
+
The object that we need for this workflow is created with the `Validate` class. Such an object can handle one input table at any given time and the `data=` argument is where the table is specified. In case you want to see the content of the input table, it's provided just below.
41
+
42
+
<details>
43
+
<summary>The Polars table</summary>
44
+
45
+
```{python}
46
+
#| echo: false
47
+
48
+
pb.load_dataset(dataset="small_table")
49
+
```
50
+
51
+
</details>
41
52
42
53
#### Step 2
43
54
44
55
The validation process needs directives on how exactly to check the tabular data. To this end we draw upon validation methods to define the validation rules (e.g., `col_vals_gt()`, `col_vals_between()`, etc.). Each invocation translates to discrete validation steps. We can use as many of these as is necessary for testing the table in question---more is usually better.
45
56
46
57
#### Step 3
47
58
48
-
We conclude this process with the `interrogate()` method. All of the validation methods defined will not act on the target table until `interrogate()` is used. During the interrogation phase, the validation plan (the collection of validation steps) will be executed. We then get useful results within the Validate object.
59
+
We conclude this process with the `interrogate()` method. All of the validation methods defined will not act on the input table until `interrogate()` is used. During the interrogation phase, the validation plan (the collection of validation steps) will be executed. We then get useful results within the Validate object.
49
60
50
61
That's data validation with pointblank in a nutshell! Of course, we do want reporting on how it went down, so using the `get_tabular_report()` method as the fourth step will almost always be a thing that's done. In the next section we'll go a bit further by introducing a means to gauge data quality with failure thresholds.
51
62
@@ -74,10 +85,10 @@ validation_2 = (
74
85
validation_2.get_tabular_report()
75
86
```
76
87
77
-
As can be seen, all validation steps show some degree of failing test units. Here's a breakdown on how this can be interpreted:
88
+
As can be seen, all validation steps have some non-zero number of failing test units. Here's a breakdown on how this validation table can be interpreted:
78
89
79
90
- steps 1 and 3: failure rate high enough to enter the `WARN` and `STOP` states (more than 20% of test units failed in both steps)
80
91
- step 2: failure rate to enter the `WARN` state (more than 10% of test units failed)
81
92
- step 4: one failing test units but well below the 10%-failing `WARN` threshold
82
93
83
-
The availability of data extracts that show where the failures occurred can serve to get at the heart of what caused such failures in the first place. On the flip side, one could modify the rules of the validation steps should the flagged rows in the extracts turn out to be reasonable.
94
+
The availability of data extracts that show where the failures occurred can help you get at the heart of what caused such failures in the first place. On the flip side, one could modify the rules of the validation steps should the flagged rows in the extracts turn out to be reasonable.
0 commit comments