@@ -6,20 +6,20 @@ PyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library
66
77## What's New in PyDeequ 2.0
88
9- PyDeequ 2.0 introduces a new architecture using ** Spark Connect** , bringing significant improvements :
9+ PyDeequ 2.0 introduces a new multi-engine architecture with ** DuckDB ** and ** Spark Connect** backends :
1010
1111| Feature | PyDeequ 1.x | PyDeequ 2.0 |
1212| ---------| -------------| -------------|
13- | Communication | Py4J (JVM bridge) | Spark Connect (gRPC) |
13+ | Backends | Spark only (Py4J) | DuckDB, Spark Connect |
14+ | JVM Required | Yes | No (DuckDB) / Yes (Spark) |
1415| Assertions | Python lambdas | Serializable predicates |
15- | Spark Session | Local only | Local or Remote |
16- | Architecture | Tight JVM coupling | Clean client-server |
16+ | Remote Execution | No | Yes (Spark Connect) |
1717
1818** Key Benefits:**
19- - ** No Py4J dependency** - Uses Spark Connect protocol for communication
19+ - ** DuckDB backend** - Lightweight, no JVM required, perfect for local development and CI/CD
20+ - ** Spark Connect backend** - Production-scale processing with remote cluster support
2021- ** Serializable predicates** - Replace Python lambdas with predicate objects (` eq ` , ` gte ` , ` between ` , etc.)
21- - ** Remote execution** - Connect to remote Spark clusters via Spark Connect
22- - ** Cleaner API** - Simplified imports and more Pythonic interface
22+ - ** Unified API** - Same code works with both backends
2323
2424### Architecture
2525
@@ -46,33 +46,136 @@ flowchart LR
4646
4747### Feature Support Matrix
4848
49- | Feature | PyDeequ 1.x | PyDeequ 2.0 |
50- | ---------| :-----------:| :-----------:|
51- | ** Constraint Verification** | | |
52- | VerificationSuite | Yes | Yes |
53- | Check constraints | Yes | Yes |
54- | Custom SQL expressions | Yes | Yes |
55- | ** Metrics & Analysis** | | |
56- | AnalysisRunner | Yes | Yes |
57- | All standard analyzers | Yes | Yes |
58- | ** Column Profiling** | | |
59- | ColumnProfilerRunner | Yes | Yes |
60- | Numeric statistics | Yes | Yes |
61- | KLL sketch profiling | Yes | Yes |
62- | Low-cardinality histograms | Yes | Yes |
63- | ** Constraint Suggestions** | | |
64- | ConstraintSuggestionRunner | Yes | Yes |
65- | Rule sets (DEFAULT, EXTENDED, etc.) | Yes | Yes |
66- | Train/test split evaluation | Yes | Yes |
67- | ** Metrics Repository** | | |
68- | FileSystemMetricsRepository | Yes | Planned |
69- | ** Execution Mode** | | |
70- | Local Spark | Yes | No |
71- | Spark Connect (remote) | No | Yes |
49+ | Feature | PyDeequ 1.x | PyDeequ 2.0 (DuckDB) | PyDeequ 2.0 (Spark) |
50+ | ---------| :-----------:| :--------------------:| :-------------------:|
51+ | ** Constraint Verification** | | | |
52+ | VerificationSuite | Yes | Yes | Yes |
53+ | Check constraints | Yes | Yes | Yes |
54+ | Custom SQL expressions | Yes | Yes | Yes |
55+ | ** Metrics & Analysis** | | | |
56+ | AnalysisRunner | Yes | Yes | Yes |
57+ | All standard analyzers | Yes | Yes | Yes |
58+ | ** Column Profiling** | | | |
59+ | ColumnProfilerRunner | Yes | Yes | Yes |
60+ | Numeric statistics | Yes | Yes | Yes |
61+ | KLL sketch profiling | Yes | No | Yes |
62+ | Low-cardinality histograms | Yes | Yes | Yes |
63+ | ** Constraint Suggestions** | | | |
64+ | ConstraintSuggestionRunner | Yes | Yes | Yes |
65+ | Rule sets (DEFAULT, EXTENDED, etc.) | Yes | Yes | Yes |
66+ | Train/test split evaluation | Yes | No | Yes |
67+ | ** Metrics Repository** | | | |
68+ | FileSystemMetricsRepository | Yes | Planned | Planned |
69+ | ** Execution Environment** | | | |
70+ | JVM Required | Yes | No | Yes |
71+ | Local execution | Yes | Yes | Yes |
72+ | Remote execution | No | No | Yes |
7273
7374---
7475
75- ## PyDeequ 2.0 Beta - Quick Start
76+ ## Installation
77+
78+ PyDeequ 2.0 supports multiple backends. Install only what you need:
79+
80+ ** From PyPI (when published):**
81+ ``` bash
82+ # DuckDB backend (lightweight, no JVM required)
83+ pip install pydeequ[duckdb]
84+
85+ # Spark Connect backend (for production-scale processing)
86+ pip install pydeequ[spark]
87+
88+ # Both backends
89+ pip install pydeequ[all]
90+
91+ # Development (includes all backends + test tools)
92+ pip install pydeequ[dev]
93+ ```
94+
95+ ** From GitHub Release (beta):**
96+ ``` bash
97+ # Install beta wheel + DuckDB
98+ pip install https://github.com/awslabs/python-deequ/releases/download/v2.0.0b1/pydeequ-2.0.0b1-py3-none-any.whl
99+ pip install duckdb
100+
101+ # For Spark backend, also install:
102+ pip install pyspark[connect]==3.5.0
103+ ```
104+
105+ ---
106+
107+ ## Quick Start with DuckDB (Recommended for Getting Started)
108+
109+ The DuckDB backend is the easiest way to get started - no JVM or Spark server required.
110+
111+ ### Requirements
112+ - Python 3.9+
113+
114+ ### Installation
115+
116+ ``` bash
117+ pip install pydeequ[duckdb]
118+ ```
119+
120+ ### Run Your First Check
121+
122+ ``` python
123+ import duckdb
124+ import pydeequ
125+ from pydeequ.v2.analyzers import Size, Completeness, Mean
126+ from pydeequ.v2.checks import Check, CheckLevel
127+ from pydeequ.v2.predicates import eq, gte
128+
129+ # Create a DuckDB connection and load data
130+ con = duckdb.connect()
131+ con.execute("""
132+ CREATE TABLE users AS SELECT * FROM (VALUES
133+ (1, 'Alice', 25),
134+ (2, 'Bob', 30),
135+ (3, 'Charlie', NULL)
136+ ) AS t(id, name, age)
137+ """ )
138+
139+ # Create an engine from the connection
140+ engine = pydeequ.connect(con, table = " users" )
141+
142+ # Run analyzers
143+ metrics = engine.compute_metrics([
144+ Size(),
145+ Completeness(" id" ),
146+ Completeness(" age" ),
147+ Mean(" age" ),
148+ ])
149+ print (" Metrics:" )
150+ for m in metrics:
151+ print (f " { m.name} ( { m.instance} ): { m.value} " )
152+
153+ # Run constraint checks
154+ check = (Check(CheckLevel.Error, " Data quality checks" )
155+ .hasSize(eq(3 ))
156+ .isComplete(" id" )
157+ .isComplete(" name" )
158+ .hasCompleteness(" age" , gte(0.5 )))
159+
160+ results = engine.run_checks([check])
161+ print (" \n Constraint Results:" )
162+ for r in results:
163+ print (f " { r.constraint} : { r.constraint_status} " )
164+
165+ # Profile columns
166+ profiles = engine.profile_columns()
167+ print (" \n Column Profiles:" )
168+ for p in profiles:
169+ print (f " { p.column} : completeness= { p.completeness} , distinct= { p.approx_distinct_values} " )
170+
171+ con.close()
172+ ```
173+
174+ ---
175+
176+ ## Quick Start with Spark Connect (Production Scale)
177+
178+ For production workloads and large-scale data processing, use the Spark Connect backend.
76179
77180### Requirements
78181
@@ -142,6 +245,11 @@ pip install pyspark[connect]==3.5.0
142245pip install setuptools
143246```
144247
248+ Or using the extras syntax (once published to PyPI):
249+ ``` bash
250+ pip install pydeequ[spark]
251+ ```
252+
145253### Step 5: Run Your First Check
146254
147255``` python
@@ -444,7 +552,8 @@ The legacy PyDeequ API uses Py4J for JVM communication. It is still available fo
444552### Installation
445553
446554``` bash
447- pip install pydeequ
555+ # Install with Spark backend (required for 1.x API)
556+ pip install pydeequ[spark]
448557```
449558
450559** Note:** Set the ` SPARK_VERSION ` environment variable to match your Spark version.
@@ -638,15 +747,26 @@ sdk install spark 3.5.0
638747### Poetry
639748
640749``` bash
641- poetry install
750+ # Install all dependencies (including dev tools and both backends)
751+ poetry install --with dev --all-extras
752+
753+ # Or install specific extras
754+ poetry install --extras duckdb # DuckDB only
755+ poetry install --extras spark # Spark only
756+ poetry install --extras all # Both backends
757+
642758poetry update
643759poetry show -o
644760```
645761
646762### Running Tests Locally
647763
648764``` bash
765+ # Run all tests (requires Spark Connect server for comparison tests)
649766poetry run pytest
767+
768+ # Run DuckDB-only tests (no Spark required)
769+ poetry run pytest tests/engines/test_duckdb* .py tests/engines/test_operators.py
650770```
651771
652772### Running Tests (Docker)
0 commit comments