Skip to content

Commit d4efb2d

Browse files
committed
Update index.qmd
1 parent 2cff271 commit d4efb2d

File tree

1 file changed

+21
-10
lines changed

1 file changed

+21
-10
lines changed

docs/get-started/index.qmd

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ jupyter: python3
44
html-table-processing: none
55
---
66

7-
To assess the state of data quality for a table, we use the `Validate` class to collect our validation instructions and then perform the interrogation. After interrogation, we have an object that can produce reporting or enable further processing of the input table. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or a selection of DB tables. Let's walk through what a table validation looks like in pointblank!
7+
The pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or a selection of DB tables. Let's walk through what table validation looks like in pointblank!
88

99
## A Simple Example with the Basics
1010

11-
This is a validation table that's checking a Polars DataFrame:
11+
This is a validation table that produced from a validation of a Polars DataFrame:
1212

1313
```{python}
14-
# | code-fold: true
14+
#| code-fold: true
1515
1616
import pointblank as pb
1717
@@ -27,25 +27,36 @@ validation_1 = (
2727
validation_1.get_tabular_report()
2828
```
2929

30-
Each row is a validation step. The left-hand side outlines the validation rules. The right-hand side provides the results of each validation step.
30+
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step.
3131

32-
While simple in principle, there's a lot of useful information packed into this validation table! The bright green color strips at the left of each validation step indicates that all test units passed validation. The lighter green color in the second step means that there was at least one failing unit. What are test units? Each validation step could perform one or many atomic tests (e.g., one test per cell in column). It's quite a bit to take in, so here's a diagram that describes the different parts of the validation table:
32+
While simple in principle, there's a lot of useful information packed into this validation table! The bright green color strips at the left of each validation step indicates that all test units passed validation. The lighter green color in the second step means that there was at least one failing unit. What are test units? Each validation step could perform one or many atomic tests (e.g., one test per cell in column). It's quite a bit to take in, so here's a diagram that describes a few of the important parts of the validation table:
3333

3434
![](/assets/pointblank-validation-table.png){width=100%}
3535

36-
The code that performs the validation on the Polars table can be revealed by interacting with the `Code` disclosure triangle. Here's a rundown of how it all works in three steps.
36+
The code that performs the validation on the Polars table can be revealed by interacting with the `Code` disclosure triangle up above. Here's a rundown of how it all works, in three steps.
3737

3838
#### Step 1
3939

40-
The object that we need for this workflow is created with the `Validate` class. Such an object can handle one target table at any given time and the `data=` argument is where the table is specified.
40+
The object that we need for this workflow is created with the `Validate` class. Such an object can handle one input table at any given time and the `data=` argument is where the table is specified. In case you want to see the content of the input table, it's provided just below.
41+
42+
<details>
43+
<summary>The Polars table</summary>
44+
45+
```{python}
46+
#| echo: false
47+
48+
pb.load_dataset(dataset="small_table")
49+
```
50+
51+
</details>
4152

4253
#### Step 2
4354

4455
The validation process needs directives on how exactly to check the tabular data. To this end we draw upon validation methods to define the validation rules (e.g., `col_vals_gt()`, `col_vals_between()`, etc.). Each invocation translates to discrete validation steps. We can use as many of these as is necessary for testing the table in question---more is usually better.
4556

4657
#### Step 3
4758

48-
We conclude this process with the `interrogate()` method. All of the validation methods defined will not act on the target table until `interrogate()` is used. During the interrogation phase, the validation plan (the collection of validation steps) will be executed. We then get useful results within the Validate object.
59+
We conclude this process with the `interrogate()` method. All of the validation methods defined will not act on the input table until `interrogate()` is used. During the interrogation phase, the validation plan (the collection of validation steps) will be executed. We then get useful results within the Validate object.
4960

5061
That's data validation with pointblank in a nutshell! Of course, we do want reporting on how it went down, so using the `get_tabular_report()` method as the fourth step will almost always be a thing that's done. In the next section we'll go a bit further by introducing a means to gauge data quality with failure thresholds.
5162

@@ -74,10 +85,10 @@ validation_2 = (
7485
validation_2.get_tabular_report()
7586
```
7687

77-
As can be seen, all validation steps show some degree of failing test units. Here's a breakdown on how this can be interpreted:
88+
As can be seen, all validation steps have some non-zero number of failing test units. Here's a breakdown on how this validation table can be interpreted:
7889

7990
- steps 1 and 3: failure rate high enough to enter the `WARN` and `STOP` states (more than 20% of test units failed in both steps)
8091
- step 2: failure rate to enter the `WARN` state (more than 10% of test units failed)
8192
- step 4: one failing test units but well below the 10%-failing `WARN` threshold
8293

83-
The availability of data extracts that show where the failures occurred can serve to get at the heart of what caused such failures in the first place. On the flip side, one could modify the rules of the validation steps should the flagged rows in the extracts turn out to be reasonable.
94+
The availability of data extracts that show where the failures occurred can help you get at the heart of what caused such failures in the first place. On the flip side, one could modify the rules of the validation steps should the flagged rows in the extracts turn out to be reasonable.

0 commit comments

Comments
 (0)