Skip to content

Commit 34280d9

Browse files
committed
Update api-docs.txt
1 parent a1047ad commit 34280d9

File tree

1 file changed

+219
-8
lines changed

1 file changed

+219
-8
lines changed

pointblank/data/api-docs.txt

Lines changed: 219 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,11 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
4242
Parameters
4343
----------
4444
data
45-
The table to validate, which could be a DataFrame object or an Ibis table object. Read the
45+
The table to validate, which could be a DataFrame object, an Ibis table object, a CSV
46+
file path, or a Parquet file path. When providing a CSV or Parquet file path (as a string
47+
or `pathlib.Path` object), the file will be automatically loaded using an available
48+
DataFrame library (Polars or Pandas). Parquet input also supports glob patterns,
49+
directories containing .parquet files, and Spark-style partitioned datasets. Read the
4650
*Supported Input Table Types* section for details on the supported table types.
4751
tbl_name
4852
An optional name to assign to the input table object. If no value is provided, a name will
@@ -113,12 +117,18 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
113117
- PySpark table (`"pyspark"`)*
114118
- BigQuery table (`"bigquery"`)*
115119
- Parquet table (`"parquet"`)*
120+
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
121+
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or Spark-style partitioned dataset)
116122

117123
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
118124
`ibis.expr.types.relations.Table`). Furthermore, the use of `Validate` with such tables requires
119125
the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas
120126
DataFrame, the Ibis library is not required.
121127

128+
To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
129+
provided. The file will be automatically detected and loaded using the best available DataFrame
130+
library. The loading preference is Polars first, then Pandas as a fallback.
131+
122132
Thresholds
123133
----------
124134
The `thresholds=` parameter is used to set the failure-condition levels for all validation
@@ -275,8 +285,8 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
275285
```python
276286
import pointblank as pb
277287

278-
# Load the small_table dataset
279-
small_table = pb.load_dataset()
288+
# Load the `small_table` dataset
289+
small_table = pb.load_dataset(dataset="small_table", tbl_type="polars")
280290

281291
# Preview the table
282292
pb.preview(small_table)
@@ -342,7 +352,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
342352
brief). Here's an example of a global setting for briefs:
343353

344354
```python
345-
validation = (
355+
validation_2 = (
346356
pb.Validate(
347357
data=pb.load_dataset(),
348358
tbl_name="small_table",
@@ -359,7 +369,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
359369
.interrogate()
360370
)
361371

362-
validation
372+
validation_2
363373
```
364374

365375
We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore,
@@ -377,27 +387,130 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
377387
the data extracts for each validation step.
378388

379389
```python
380-
validation.get_data_extracts()
390+
validation_2.get_data_extracts()
381391
```
382392

383393
We can also view step reports for each validation step using the
384394
[`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the
385395
type of validation step and shows the relevant information for a step's validation.
386396

387397
```python
388-
validation.get_step_report(i=2)
398+
validation_2.get_step_report(i=2)
389399
```
390400

391401
The `Validate` class also has a method for getting the sundered data, which is the data that
392402
passed or failed the validation steps. This can be done using the
393403
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method.
394404

395405
```python
396-
pb.preview(validation.get_sundered_data())
406+
pb.preview(validation_2.get_sundered_data())
397407
```
398408

399409
The sundered data is a DataFrame that contains the rows that passed or failed the validation.
400410
The default behavior is to return the rows that failed the validation, as shown above.
411+
412+
### Working with CSV Files
413+
414+
The `Validate` class can directly accept CSV file paths, making it easy to validate data stored
415+
in CSV files without manual loading:
416+
417+
```python
418+
# Get a path to a CSV file from the package data
419+
csv_path = pb.get_data_path("global_sales", "csv")
420+
421+
validation_3 = (
422+
pb.Validate(
423+
data=csv_path,
424+
label="CSV validation example"
425+
)
426+
.col_exists(["customer_id", "product_id", "revenue"])
427+
.col_vals_not_null(["customer_id", "product_id"])
428+
.col_vals_gt(columns="revenue", value=0)
429+
.interrogate()
430+
)
431+
432+
validation_3
433+
```
434+
435+
You can also work with the game revenue dataset using a Path object:
436+
437+
```python
438+
from pathlib import Path
439+
440+
csv_file = Path(pb.get_data_path("game_revenue", "csv"))
441+
442+
validation_4 = (
443+
pb.Validate(data=csv_file, label="Game Revenue Validation")
444+
.col_exists(["player_id", "session_id", "item_name"])
445+
.col_vals_regex(
446+
columns="session_id",
447+
pattern=r"[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}"
448+
)
449+
.col_vals_gt(columns="item_revenue", value=0, na_pass=True)
450+
.interrogate()
451+
)
452+
453+
validation_4
454+
```
455+
456+
The CSV loading is automatic, so when a string or Path with a `.csv` extension is provided,
457+
Pointblank will automatically load the file using the best available DataFrame library (Polars
458+
preferred, Pandas as fallback). The loaded data can then be used with all validation methods
459+
just like any other supported table type.
460+
461+
### Working with Parquet Files
462+
463+
The `Validate` class can directly accept Parquet files and datasets in various formats. The
464+
following examples illustrate how to validate Parquet files:
465+
466+
```python
467+
# Single Parquet file from package data
468+
parquet_path = pb.get_data_path("nycflights", "parquet")
469+
470+
validation_5 = (
471+
pb.Validate(
472+
data=parquet_path,
473+
tbl_name="NYC Flights Data"
474+
)
475+
.col_vals_not_null(["carrier", "origin", "dest"])
476+
.col_vals_gt(columns="distance", value=0)
477+
.interrogate()
478+
)
479+
480+
validation_5
481+
```
482+
483+
You can also use glob patterns and directories. Here are some examples for how to:
484+
485+
1. load multiple Parquet files
486+
2. load a Parquet-containing directory
487+
3. load a partitioned Parquet dataset
488+
489+
```python
490+
# Multiple Parquet files with glob patterns
491+
validation_6 = pb.Validate(data="data/sales_*.parquet")
492+
493+
# Directory containing Parquet files
494+
validation_7 = pb.Validate(data="parquet_data/")
495+
496+
# Partitioned Parquet dataset
497+
validation_8 = (
498+
pb.Validate(data="sales_data/") # Contains year=2023/quarter=Q1/region=US/sales.parquet
499+
.col_exists(["transaction_id", "amount", "year", "quarter", "region"])
500+
.interrogate()
501+
)
502+
```
503+
504+
When you point to a directory that contains a partitioned Parquet dataset (with subdirectories
505+
like `year=2023/quarter=Q1/region=US/`), Pointblank will automatically:
506+
507+
- discover all Parquet files recursively
508+
- extract partition column values from directory paths
509+
- add partition columns to the final DataFrame
510+
- combine all partitions into a single table for validation
511+
512+
Both Polars and Pandas handle partitioned datasets natively, so this works seamlessly with
513+
either DataFrame library. The loading preference is Polars first, then Pandas as a fallback.
401514

402515

403516
Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None
@@ -9172,6 +9285,104 @@ load_dataset(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'glo
91729285
regions: North America, Europe, or Asia.
91739286

91749287

9288+
get_data_path(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', file_type: "Literal['csv', 'parquet', 'duckdb']" = 'csv') -> 'str'
9289+
9290+
Get the file path to a dataset included with the Pointblank package.
9291+
9292+
This function provides direct access to the file paths of datasets included with Pointblank.
9293+
These paths can be used in examples and documentation to demonstrate file-based data loading
9294+
without requiring the actual data files. The returned paths can be used with
9295+
`Validate(data=path)` to demonstrate CSV and Parquet file loading capabilities.
9296+
9297+
Parameters
9298+
----------
9299+
dataset
9300+
The name of the dataset to get the path for. Current options are `"small_table"`,
9301+
`"game_revenue"`, `"nycflights"`, and `"global_sales"`.
9302+
file_type
9303+
The file format to get the path for. Options are `"csv"`, `"parquet"`, or `"duckdb"`.
9304+
9305+
Returns
9306+
-------
9307+
str
9308+
The file path to the requested dataset file.
9309+
9310+
Included Datasets
9311+
-----------------
9312+
The available datasets are the same as those in [`load_dataset()`](`pointblank.load_dataset`):
9313+
9314+
- `"small_table"`: A small dataset with 13 rows and 8 columns. Ideal for testing and examples.
9315+
- `"game_revenue"`: A dataset with 2000 rows and 11 columns. Revenue data for a game company.
9316+
- `"nycflights"`: A dataset with 336,776 rows and 18 columns. Flight data from NYC airports.
9317+
- `"global_sales"`: A dataset with 50,000 rows and 20 columns. Global sales data across regions.
9318+
9319+
File Types
9320+
----------
9321+
Each dataset is available in multiple formats:
9322+
9323+
- `"csv"`: Comma-separated values file (`.csv`)
9324+
- `"parquet"`: Parquet file (`.parquet`)
9325+
- `"duckdb"`: DuckDB database file (`.ddb`)
9326+
9327+
Examples
9328+
--------
9329+
Get the path to a CSV file and use it with `Validate`:
9330+
9331+
```python
9332+
import pointblank as pb
9333+
9334+
# Get path to the small_table CSV file
9335+
csv_path = pb.get_data_path("small_table", "csv")
9336+
print(csv_path)
9337+
9338+
# Use the path directly with Validate
9339+
validation = (
9340+
pb.Validate(data=csv_path)
9341+
.col_exists(["a", "b", "c"])
9342+
.col_vals_gt(columns="d", value=0)
9343+
.interrogate()
9344+
)
9345+
9346+
validation
9347+
```
9348+
9349+
Get a Parquet file path for validation examples:
9350+
9351+
```python
9352+
# Get path to the game_revenue Parquet file
9353+
parquet_path = pb.get_data_path(dataset="game_revenue", file_type="parquet")
9354+
9355+
# Validate the Parquet file directly
9356+
validation = (
9357+
pb.Validate(data=parquet_path, label="Game Revenue Data Validation")
9358+
.col_vals_not_null(columns=["player_id", "session_id"])
9359+
.col_vals_gt(columns="item_revenue", value=0)
9360+
.interrogate()
9361+
)
9362+
9363+
validation
9364+
```
9365+
9366+
This is particularly useful for documentation examples where you want to demonstrate
9367+
file-based workflows without requiring users to have specific data files:
9368+
9369+
```python
9370+
# Example showing CSV file validation
9371+
sales_csv = pb.get_data_path(dataset="global_sales", file_type="csv")
9372+
9373+
validation = (
9374+
pb.Validate(data=sales_csv, label="Sales Data Validation")
9375+
.col_exists(["customer_id", "product_id", "amount"])
9376+
.col_vals_regex(columns="customer_id", pattern=r"CUST_[0-9]{6}")
9377+
.interrogate()
9378+
)
9379+
```
9380+
9381+
See Also
9382+
--------
9383+
[`load_dataset()`](`pointblank.load_dataset`) for loading datasets directly as table objects.
9384+
9385+
91759386

91769387
## The Utility Functions family
91779388

0 commit comments

Comments
 (0)