@@ -42,7 +42,11 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
42
42
Parameters
43
43
----------
44
44
data
45
- The table to validate, which could be a DataFrame object or an Ibis table object. Read the
45
+ The table to validate, which could be a DataFrame object, an Ibis table object, a CSV
46
+ file path, or a Parquet file path. When providing a CSV or Parquet file path (as a string
47
+ or `pathlib.Path` object), the file will be automatically loaded using an available
48
+ DataFrame library (Polars or Pandas). Parquet input also supports glob patterns,
49
+ directories containing .parquet files, and Spark-style partitioned datasets. Read the
46
50
*Supported Input Table Types* section for details on the supported table types.
47
51
tbl_name
48
52
An optional name to assign to the input table object. If no value is provided, a name will
@@ -113,12 +117,18 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
113
117
- PySpark table (`"pyspark"`)*
114
118
- BigQuery table (`"bigquery"`)*
115
119
- Parquet table (`"parquet"`)*
120
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
121
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or Spark-style partitioned dataset)
116
122
117
123
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
118
124
`ibis.expr.types.relations.Table`). Furthermore, the use of `Validate` with such tables requires
119
125
the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas
120
126
DataFrame, the Ibis library is not required.
121
127
128
+ To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
129
+ provided. The file will be automatically detected and loaded using the best available DataFrame
130
+ library. The loading preference is Polars first, then Pandas as a fallback.
131
+
122
132
Thresholds
123
133
----------
124
134
The `thresholds=` parameter is used to set the failure-condition levels for all validation
@@ -275,8 +285,8 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
275
285
```python
276
286
import pointblank as pb
277
287
278
- # Load the small_table dataset
279
- small_table = pb.load_dataset()
288
+ # Load the ` small_table` dataset
289
+ small_table = pb.load_dataset(dataset="small_table", tbl_type="polars" )
280
290
281
291
# Preview the table
282
292
pb.preview(small_table)
@@ -342,7 +352,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
342
352
brief). Here's an example of a global setting for briefs:
343
353
344
354
```python
345
- validation = (
355
+ validation_2 = (
346
356
pb.Validate(
347
357
data=pb.load_dataset(),
348
358
tbl_name="small_table",
@@ -359,7 +369,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
359
369
.interrogate()
360
370
)
361
371
362
- validation
372
+ validation_2
363
373
```
364
374
365
375
We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore,
@@ -377,27 +387,130 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
377
387
the data extracts for each validation step.
378
388
379
389
```python
380
- validation .get_data_extracts()
390
+ validation_2 .get_data_extracts()
381
391
```
382
392
383
393
We can also view step reports for each validation step using the
384
394
[`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the
385
395
type of validation step and shows the relevant information for a step's validation.
386
396
387
397
```python
388
- validation .get_step_report(i=2)
398
+ validation_2 .get_step_report(i=2)
389
399
```
390
400
391
401
The `Validate` class also has a method for getting the sundered data, which is the data that
392
402
passed or failed the validation steps. This can be done using the
393
403
[`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method.
394
404
395
405
```python
396
- pb.preview(validation .get_sundered_data())
406
+ pb.preview(validation_2 .get_sundered_data())
397
407
```
398
408
399
409
The sundered data is a DataFrame that contains the rows that passed or failed the validation.
400
410
The default behavior is to return the rows that failed the validation, as shown above.
411
+
412
+ ### Working with CSV Files
413
+
414
+ The `Validate` class can directly accept CSV file paths, making it easy to validate data stored
415
+ in CSV files without manual loading:
416
+
417
+ ```python
418
+ # Get a path to a CSV file from the package data
419
+ csv_path = pb.get_data_path("global_sales", "csv")
420
+
421
+ validation_3 = (
422
+ pb.Validate(
423
+ data=csv_path,
424
+ label="CSV validation example"
425
+ )
426
+ .col_exists(["customer_id", "product_id", "revenue"])
427
+ .col_vals_not_null(["customer_id", "product_id"])
428
+ .col_vals_gt(columns="revenue", value=0)
429
+ .interrogate()
430
+ )
431
+
432
+ validation_3
433
+ ```
434
+
435
+ You can also work with the game revenue dataset using a Path object:
436
+
437
+ ```python
438
+ from pathlib import Path
439
+
440
+ csv_file = Path(pb.get_data_path("game_revenue", "csv"))
441
+
442
+ validation_4 = (
443
+ pb.Validate(data=csv_file, label="Game Revenue Validation")
444
+ .col_exists(["player_id", "session_id", "item_name"])
445
+ .col_vals_regex(
446
+ columns="session_id",
447
+ pattern=r"[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}"
448
+ )
449
+ .col_vals_gt(columns="item_revenue", value=0, na_pass=True)
450
+ .interrogate()
451
+ )
452
+
453
+ validation_4
454
+ ```
455
+
456
+ The CSV loading is automatic, so when a string or Path with a `.csv` extension is provided,
457
+ Pointblank will automatically load the file using the best available DataFrame library (Polars
458
+ preferred, Pandas as fallback). The loaded data can then be used with all validation methods
459
+ just like any other supported table type.
460
+
461
+ ### Working with Parquet Files
462
+
463
+ The `Validate` class can directly accept Parquet files and datasets in various formats. The
464
+ following examples illustrate how to validate Parquet files:
465
+
466
+ ```python
467
+ # Single Parquet file from package data
468
+ parquet_path = pb.get_data_path("nycflights", "parquet")
469
+
470
+ validation_5 = (
471
+ pb.Validate(
472
+ data=parquet_path,
473
+ tbl_name="NYC Flights Data"
474
+ )
475
+ .col_vals_not_null(["carrier", "origin", "dest"])
476
+ .col_vals_gt(columns="distance", value=0)
477
+ .interrogate()
478
+ )
479
+
480
+ validation_5
481
+ ```
482
+
483
+ You can also use glob patterns and directories. Here are some examples for how to:
484
+
485
+ 1. load multiple Parquet files
486
+ 2. load a Parquet-containing directory
487
+ 3. load a partitioned Parquet dataset
488
+
489
+ ```python
490
+ # Multiple Parquet files with glob patterns
491
+ validation_6 = pb.Validate(data="data/sales_*.parquet")
492
+
493
+ # Directory containing Parquet files
494
+ validation_7 = pb.Validate(data="parquet_data/")
495
+
496
+ # Partitioned Parquet dataset
497
+ validation_8 = (
498
+ pb.Validate(data="sales_data/") # Contains year=2023/quarter=Q1/region=US/sales.parquet
499
+ .col_exists(["transaction_id", "amount", "year", "quarter", "region"])
500
+ .interrogate()
501
+ )
502
+ ```
503
+
504
+ When you point to a directory that contains a partitioned Parquet dataset (with subdirectories
505
+ like `year=2023/quarter=Q1/region=US/`), Pointblank will automatically:
506
+
507
+ - discover all Parquet files recursively
508
+ - extract partition column values from directory paths
509
+ - add partition columns to the final DataFrame
510
+ - combine all partitions into a single table for validation
511
+
512
+ Both Polars and Pandas handle partitioned datasets natively, so this works seamlessly with
513
+ either DataFrame library. The loading preference is Polars first, then Pandas as a fallback.
401
514
402
515
403
516
Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None
@@ -9172,6 +9285,104 @@ load_dataset(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'glo
9172
9285
regions: North America, Europe, or Asia.
9173
9286
9174
9287
9288
+ get_data_path(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', file_type: "Literal['csv', 'parquet', 'duckdb']" = 'csv') -> 'str'
9289
+
9290
+ Get the file path to a dataset included with the Pointblank package.
9291
+
9292
+ This function provides direct access to the file paths of datasets included with Pointblank.
9293
+ These paths can be used in examples and documentation to demonstrate file-based data loading
9294
+ without requiring the actual data files. The returned paths can be used with
9295
+ `Validate(data=path)` to demonstrate CSV and Parquet file loading capabilities.
9296
+
9297
+ Parameters
9298
+ ----------
9299
+ dataset
9300
+ The name of the dataset to get the path for. Current options are `"small_table"`,
9301
+ `"game_revenue"`, `"nycflights"`, and `"global_sales"`.
9302
+ file_type
9303
+ The file format to get the path for. Options are `"csv"`, `"parquet"`, or `"duckdb"`.
9304
+
9305
+ Returns
9306
+ -------
9307
+ str
9308
+ The file path to the requested dataset file.
9309
+
9310
+ Included Datasets
9311
+ -----------------
9312
+ The available datasets are the same as those in [`load_dataset()`](`pointblank.load_dataset`):
9313
+
9314
+ - `"small_table"`: A small dataset with 13 rows and 8 columns. Ideal for testing and examples.
9315
+ - `"game_revenue"`: A dataset with 2000 rows and 11 columns. Revenue data for a game company.
9316
+ - `"nycflights"`: A dataset with 336,776 rows and 18 columns. Flight data from NYC airports.
9317
+ - `"global_sales"`: A dataset with 50,000 rows and 20 columns. Global sales data across regions.
9318
+
9319
+ File Types
9320
+ ----------
9321
+ Each dataset is available in multiple formats:
9322
+
9323
+ - `"csv"`: Comma-separated values file (`.csv`)
9324
+ - `"parquet"`: Parquet file (`.parquet`)
9325
+ - `"duckdb"`: DuckDB database file (`.ddb`)
9326
+
9327
+ Examples
9328
+ --------
9329
+ Get the path to a CSV file and use it with `Validate`:
9330
+
9331
+ ```python
9332
+ import pointblank as pb
9333
+
9334
+ # Get path to the small_table CSV file
9335
+ csv_path = pb.get_data_path("small_table", "csv")
9336
+ print(csv_path)
9337
+
9338
+ # Use the path directly with Validate
9339
+ validation = (
9340
+ pb.Validate(data=csv_path)
9341
+ .col_exists(["a", "b", "c"])
9342
+ .col_vals_gt(columns="d", value=0)
9343
+ .interrogate()
9344
+ )
9345
+
9346
+ validation
9347
+ ```
9348
+
9349
+ Get a Parquet file path for validation examples:
9350
+
9351
+ ```python
9352
+ # Get path to the game_revenue Parquet file
9353
+ parquet_path = pb.get_data_path(dataset="game_revenue", file_type="parquet")
9354
+
9355
+ # Validate the Parquet file directly
9356
+ validation = (
9357
+ pb.Validate(data=parquet_path, label="Game Revenue Data Validation")
9358
+ .col_vals_not_null(columns=["player_id", "session_id"])
9359
+ .col_vals_gt(columns="item_revenue", value=0)
9360
+ .interrogate()
9361
+ )
9362
+
9363
+ validation
9364
+ ```
9365
+
9366
+ This is particularly useful for documentation examples where you want to demonstrate
9367
+ file-based workflows without requiring users to have specific data files:
9368
+
9369
+ ```python
9370
+ # Example showing CSV file validation
9371
+ sales_csv = pb.get_data_path(dataset="global_sales", file_type="csv")
9372
+
9373
+ validation = (
9374
+ pb.Validate(data=sales_csv, label="Sales Data Validation")
9375
+ .col_exists(["customer_id", "product_id", "amount"])
9376
+ .col_vals_regex(columns="customer_id", pattern=r"CUST_[0-9]{6}")
9377
+ .interrogate()
9378
+ )
9379
+ ```
9380
+
9381
+ See Also
9382
+ --------
9383
+ [`load_dataset()`](`pointblank.load_dataset`) for loading datasets directly as table objects.
9384
+
9385
+
9175
9386
9176
9387
## The Utility Functions family
9177
9388
0 commit comments