@@ -43,11 +43,13 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
43
43
----------
44
44
data
45
45
The table to validate, which could be a DataFrame object, an Ibis table object, a CSV
46
- file path, or a Parquet file path. When providing a CSV or Parquet file path (as a string
47
- or `pathlib.Path` object), the file will be automatically loaded using an available
48
- DataFrame library (Polars or Pandas). Parquet input also supports glob patterns,
49
- directories containing .parquet files, and Spark-style partitioned datasets. Read the
50
- *Supported Input Table Types* section for details on the supported table types.
46
+ file path, a Parquet file path, or a database connection string. When providing a CSV or
47
+ Parquet file path (as a string or `pathlib.Path` object), the file will be automatically
48
+ loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports
49
+ glob patterns, directories containing .parquet files, and Spark-style partitioned datasets.
50
+ Connection strings enable direct database access via Ibis with optional table specification
51
+ using the `::table_name` suffix. Read the *Supported Input Table Types* section for details
52
+ on the supported table types.
51
53
tbl_name
52
54
An optional name to assign to the input table object. If no value is provided, a name will
53
55
be generated based on whatever information is available. This table name will be displayed
@@ -120,6 +122,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
120
122
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
121
123
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
122
124
extension, or partitioned dataset)
125
+ - Database connection strings (URI format with optional table specification)
123
126
124
127
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
125
128
`ibis.expr.types.relations.Table`). Furthermore, the use of `Validate` with such tables requires
@@ -130,6 +133,20 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
130
133
provided. The file will be automatically detected and loaded using the best available DataFrame
131
134
library. The loading preference is Polars first, then Pandas as a fallback.
132
135
136
+ Connection strings follow database URL formats and must also specify a table using the
137
+ `::table_name` suffix. Examples include:
138
+
139
+ ```
140
+ "duckdb:///path/to/database.ddb::table_name"
141
+ "sqlite:///path/to/database.db::table_name"
142
+ "postgresql://user:password@localhost:5432/database::table_name"
143
+ "mysql://user:password@localhost:3306/database::table_name"
144
+ "bigquery://project/dataset::table_name"
145
+ "snowflake://user:password@account/database/schema::table_name"
146
+ ```
147
+
148
+ When using connection strings, the Ibis library with the appropriate backend driver is required.
149
+
133
150
Thresholds
134
151
----------
135
152
The `thresholds=` parameter is used to set the failure-condition levels for all validation
@@ -512,6 +529,33 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
512
529
513
530
Both Polars and Pandas handle partitioned datasets natively, so this works seamlessly with
514
531
either DataFrame library. The loading preference is Polars first, then Pandas as a fallback.
532
+
533
+ ### Working with Database Connection Strings
534
+
535
+ The `Validate` class supports database connection strings for direct validation of database
536
+ tables. Connection strings must specify a table using the `::table_name` suffix:
537
+
538
+ ```python
539
+ # Get path to a DuckDB database file from package data
540
+ duckdb_path = pb.get_data_path("game_revenue", "duckdb")
541
+
542
+ validation_9 = (
543
+ pb.Validate(
544
+ data=f"duckdb:///{duckdb_path}::game_revenue",
545
+ label="DuckDB Game Revenue Validation"
546
+ )
547
+ .col_exists(["player_id", "session_id", "item_revenue"])
548
+ .col_vals_gt(columns="item_revenue", value=0)
549
+ .interrogate()
550
+ )
551
+
552
+ validation_9
553
+ ```
554
+
555
+ For comprehensive documentation on supported connection string formats, error handling, and
556
+ installation requirements, see the [`connect_to_table()`](`pointblank.connect_to_table`)
557
+ function. This function handles all the connection logic and provides helpful error messages
558
+ when table specifications are missing or backend dependencies are not installed.
515
559
516
560
517
561
Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None
@@ -8802,8 +8846,14 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
8802
8846
Parameters
8803
8847
----------
8804
8848
data
8805
- The table to preview, which could be a DataFrame object or an Ibis table object. Read the
8806
- *Supported Input Table Types* section for details on the supported table types.
8849
+ The table to preview, which could be a DataFrame object, an Ibis table object, a CSV
8850
+ file path, a Parquet file path, or a database connection string. When providing a CSV or
8851
+ Parquet file path (as a string or `pathlib.Path` object), the file will be automatically
8852
+ loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports
8853
+ glob patterns, directories containing .parquet files, and Spark-style partitioned datasets.
8854
+ Connection strings enable direct database access via Ibis with optional table specification
8855
+ using the `::table_name` suffix. Read the *Supported Input Table Types* section for details
8856
+ on the supported table types.
8807
8857
columns_subset
8808
8858
The columns to display in the table, by default `None` (all columns are shown). This can
8809
8859
be a string, a list of strings, a `Column` object, or a `ColumnSelector` object. The latter
@@ -8854,12 +8904,34 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
8854
8904
- PySpark table (`"pyspark"`)*
8855
8905
- BigQuery table (`"bigquery"`)*
8856
8906
- Parquet table (`"parquet"`)*
8907
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
8908
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
8909
+ extension, or partitioned dataset)
8910
+ - Database connection strings (URI format with optional table specification)
8857
8911
8858
8912
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
8859
8913
`ibis.expr.types.relations.Table`). Furthermore, using `preview()` with these types of tables
8860
8914
requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or
8861
8915
Pandas DataFrame, the availability of Ibis is not needed.
8862
8916
8917
+ To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
8918
+ provided. The file will be automatically detected and loaded using the best available DataFrame
8919
+ library. The loading preference is Polars first, then Pandas as a fallback.
8920
+
8921
+ Connection strings follow database URL formats and must also specify a table using the
8922
+ `::table_name` suffix. Examples include:
8923
+
8924
+ ```
8925
+ "duckdb:///path/to/database.ddb::table_name"
8926
+ "sqlite:///path/to/database.db::table_name"
8927
+ "postgresql://user:password@localhost:5432/database::table_name"
8928
+ "mysql://user:password@localhost:3306/database::table_name"
8929
+ "bigquery://project/dataset::table_name"
8930
+ "snowflake://user:password@account/database/schema::table_name"
8931
+ ```
8932
+
8933
+ When using connection strings, the Ibis library with the appropriate backend driver is required.
8934
+
8863
8935
Examples
8864
8936
--------
8865
8937
It's easy to preview a table using the `preview()` function. Here's an example using the
@@ -8918,6 +8990,39 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
8918
8990
columns_subset=pb.col(pb.starts_with("item") | pb.matches("player"))
8919
8991
)
8920
8992
```
8993
+
8994
+ ### Working with CSV Files
8995
+
8996
+ The `preview()` function can directly accept CSV file paths, making it easy to preview data
8997
+ stored in CSV files without manual loading:
8998
+
8999
+ You can also use a Path object to specify the CSV file:
9000
+
9001
+ ### Working with Parquet Files
9002
+
9003
+ The `preview()` function can directly accept Parquet files and datasets in various formats:
9004
+
9005
+ You can also use glob patterns and directories:
9006
+
9007
+ ```python
9008
+ # Multiple Parquet files with glob patterns
9009
+ pb.preview("data/sales_*.parquet")
9010
+
9011
+ # Directory containing Parquet files
9012
+ pb.preview("parquet_data/")
9013
+
9014
+ # Partitioned Parquet dataset
9015
+ pb.preview("sales_data/") # Auto-discovers partition columns
9016
+ ```
9017
+
9018
+ ### Working with Database Connection Strings
9019
+
9020
+ The `preview()` function supports database connection strings for direct preview of database
9021
+ tables. Connection strings must specify a table using the `::table_name` suffix:
9022
+
9023
+ For comprehensive documentation on supported connection string formats, error handling, and
9024
+ installation requirements, see the [`connect_to_table()`](`pointblank.connect_to_table`)
9025
+ function.
8921
9026
8922
9027
8923
9028
col_summary_tbl(data: 'FrameT | Any', tbl_name: 'str | None' = None) -> 'GT'
0 commit comments