Skip to content

Commit 2ed5dde

Browse files
committed
Update api-docs.txt
1 parent 500837f commit 2ed5dde

File tree

1 file changed

+82
-17
lines changed

1 file changed

+82
-17
lines changed

pointblank/data/api-docs.txt

Lines changed: 82 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
6060
[`Thresholds`](`pointblank.Thresholds`) object.
6161
actions
6262
The actions to take when validation steps meet or exceed any set threshold levels. This
63-
should be provided in the form of an `Actions` object. If `None` then no default actions
63+
should be provided in the form of an `Actions` object. If `None` then no global actions
6464
will be set.
6565
brief
6666
A global setting for briefs, which are optional brief descriptions for validation steps
@@ -104,7 +104,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
104104

105105
Examples
106106
--------
107-
## Creating a validation plan and interrogating
107+
### Creating a validation plan and interrogating
108108

109109
Let's walk through a data quality analysis of an extremely small table. It's actually called
110110
`"small_table"` and it's accessible through the [`load_dataset()`](`pointblank.load_dataset`)
@@ -170,11 +170,72 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
170170
[`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method, which contains
171171
options for modifying the display of the table.
172172

173-
Furthermore, post-interrogation methods such as
174-
[`get_step_report()`](`pointblank.Validate.get_step_report`),
175-
[`get_data_extracts()`](`pointblank.Validate.get_data_extracts`), and
176-
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) allow you to generate
177-
additional reporting or extract useful data for downstream analysis from a `Validate` object.
173+
### Adding briefs
174+
175+
Briefs are short descriptions of the validation steps. While they can be set for each step
176+
individually, they can also be set globally. The global setting is done by using the
177+
`brief=` argument in `Validate`. The global setting can be as simple as `True` to have
178+
automatically-generated briefs for each step. Alternatively, we can use templating elements
179+
like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated
180+
brief). Here's an example of a global setting for briefs:
181+
182+
```python
183+
validation = (
184+
pb.Validate(
185+
data=pb.load_dataset(),
186+
tbl_name="small_table",
187+
label="Validation example with briefs",
188+
brief="Step {step}: {auto}",
189+
)
190+
.col_vals_gt(columns="d", value=100)
191+
.col_vals_between(columns="c", left=3, right=10, na_pass=True)
192+
.col_vals_regex(
193+
columns="b",
194+
pattern=r"[0-9]-[a-z]{3}-[0-9]{3}",
195+
brief="Regex check for column {col}"
196+
)
197+
.interrogate()
198+
)
199+
200+
validation
201+
```
202+
203+
We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore,
204+
the global brief's template (`"Step {step}: {auto}"`) is applied to all steps except for the
205+
final step, where the step-level `brief=` argument provided an override.
206+
207+
If you should want to cancel the globally-defined brief for one or more validation steps, you
208+
can set `brief=False` in those particular steps.
209+
210+
### Post-interrogation methods
211+
212+
The `Validate` class has a number of post-interrogation methods that can be used to extract
213+
useful information from the validation results. For example, the
214+
[`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method can be used to get
215+
the data extracts for each validation step.
216+
217+
```python
218+
validation.get_data_extracts()
219+
```
220+
221+
We can also view step reports for each validation step using the
222+
[`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the
223+
type of validation step and shows the relevant information for a step's validation.
224+
225+
```python
226+
validation.get_step_report(i=2)
227+
```
228+
229+
The `Validate` class also has a method for getting the sundered data, which is the data that
230+
passed or failed the validation steps. This can be done using the
231+
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method.
232+
233+
```python
234+
pb.preview(validation.get_sundered_data())
235+
```
236+
237+
The sundered data is a DataFrame that contains the rows that passed or failed the validation.
238+
The default behavior is to return the rows that failed the validation, as shown above.
178239

179240

180241
Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None
@@ -4169,7 +4230,7 @@ validation steps, (3) `interrogate()`. After interrogation of the data, we can v
41694230
report table (by printing the object or using `get_tabular_report()`), extract key metrics, or we
41704231
can split the data based on the validation results (with `get_sundered_data()`).
41714232

4172-
interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, sample_limit: 'int' = 5000) -> 'Validate'
4233+
interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, extract_limit: 'int' = 500) -> 'Validate'
41734234

41744235
Execute each validation step against the table and store the results.
41754236

@@ -4179,8 +4240,8 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
41794240

41804241
The interrogation process will collect extracts of failing rows if the `collect_extracts=`
41814242
option is set to `True` (the default). We can control the number of rows collected using the
4182-
`get_first_n=`, `sample_n=`, and `sample_frac=` options. The `sample_limit=` option will
4183-
enforce a hard limit on the number of rows collected when using the `sample_frac=` option.
4243+
`get_first_n=`, `sample_n=`, and `sample_frac=` options. The `extract_limit=` option will
4244+
enforce a hard limit on the number of rows collected when `collect_extracts=True`.
41844245

41854246
After interrogation is complete, the `Validate` object will have gathered information, and
41864247
we can use methods like [`n_passed()`](`pointblank.Validate.n_passed`),
@@ -4199,9 +4260,9 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
41994260
The processed data frames produced by executing the validation steps is collected and
42004261
stored in the `Validate` object if `collect_tbl_checked=True`. This information is
42014262
necessary for some methods (e.g.,
4202-
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it potentially
4203-
makes the object grow to a large size. To opt out of attaching this data, set this
4204-
argument to `False`.
4263+
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it can
4264+
potentially make the object grow to a large size. To opt out of attaching this data, set
4265+
this to `False`.
42054266
get_first_n
42064267
If the option to collect rows where test units is chosen, there is the option here to
42074268
collect the first `n` rows. Supply an integer number of rows to extract from the top of
@@ -4215,11 +4276,15 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
42154276
sample_frac
42164277
If the option to collect non-passing rows is chosen, this option allows for the sampling
42174278
of a fraction of those rows. Provide a number in the range of `0` and `1`. The number of
4218-
rows to return could be very large, however, the `sample_limit=` option will apply a
4279+
rows to return could be very large, however, the `extract_limit=` option will apply a
42194280
hard limit to the returned rows.
4220-
sample_limit
4221-
A value that limits the possible number of rows returned when sampling non-passing rows
4222-
using the `sample_frac=` option.
4281+
extract_limit
4282+
A value that limits the possible number of rows returned when extracting non-passing
4283+
rows. The default is `500` rows. This limit is applied after any sampling or limiting
4284+
options are applied. If the number of rows to be returned is greater than this limit,
4285+
then the number of rows returned will be limited to this value. This is useful for
4286+
preventing the collection of too many rows when the number of non-passing rows is very
4287+
large.
42234288

42244289
Returns
42254290
-------

0 commit comments

Comments
 (0)