Skip to content

Commit b8a3244

Browse files
committed
fix readme
1 parent cbf2f3f commit b8a3244

File tree

1 file changed

+153
-33
lines changed

1 file changed

+153
-33
lines changed

README.md

Lines changed: 153 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,20 @@ PyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library
66

77
## What's New in PyDeequ 2.0
88

9-
PyDeequ 2.0 introduces a new architecture using **Spark Connect**, bringing significant improvements:
9+
PyDeequ 2.0 introduces a new multi-engine architecture with **DuckDB** and **Spark Connect** backends:
1010

1111
| Feature | PyDeequ 1.x | PyDeequ 2.0 |
1212
|---------|-------------|-------------|
13-
| Communication | Py4J (JVM bridge) | Spark Connect (gRPC) |
13+
| Backends | Spark only (Py4J) | DuckDB, Spark Connect |
14+
| JVM Required | Yes | No (DuckDB) / Yes (Spark) |
1415
| Assertions | Python lambdas | Serializable predicates |
15-
| Spark Session | Local only | Local or Remote |
16-
| Architecture | Tight JVM coupling | Clean client-server |
16+
| Remote Execution | No | Yes (Spark Connect) |
1717

1818
**Key Benefits:**
19-
- **No Py4J dependency** - Uses Spark Connect protocol for communication
19+
- **DuckDB backend** - Lightweight, no JVM required, perfect for local development and CI/CD
20+
- **Spark Connect backend** - Production-scale processing with remote cluster support
2021
- **Serializable predicates** - Replace Python lambdas with predicate objects (`eq`, `gte`, `between`, etc.)
21-
- **Remote execution** - Connect to remote Spark clusters via Spark Connect
22-
- **Cleaner API** - Simplified imports and more Pythonic interface
22+
- **Unified API** - Same code works with both backends
2323

2424
### Architecture
2525

@@ -46,33 +46,136 @@ flowchart LR
4646

4747
### Feature Support Matrix
4848

49-
| Feature | PyDeequ 1.x | PyDeequ 2.0 |
50-
|---------|:-----------:|:-----------:|
51-
| **Constraint Verification** | | |
52-
| VerificationSuite | Yes | Yes |
53-
| Check constraints | Yes | Yes |
54-
| Custom SQL expressions | Yes | Yes |
55-
| **Metrics & Analysis** | | |
56-
| AnalysisRunner | Yes | Yes |
57-
| All standard analyzers | Yes | Yes |
58-
| **Column Profiling** | | |
59-
| ColumnProfilerRunner | Yes | Yes |
60-
| Numeric statistics | Yes | Yes |
61-
| KLL sketch profiling | Yes | Yes |
62-
| Low-cardinality histograms | Yes | Yes |
63-
| **Constraint Suggestions** | | |
64-
| ConstraintSuggestionRunner | Yes | Yes |
65-
| Rule sets (DEFAULT, EXTENDED, etc.) | Yes | Yes |
66-
| Train/test split evaluation | Yes | Yes |
67-
| **Metrics Repository** | | |
68-
| FileSystemMetricsRepository | Yes | Planned |
69-
| **Execution Mode** | | |
70-
| Local Spark | Yes | No |
71-
| Spark Connect (remote) | No | Yes |
49+
| Feature | PyDeequ 1.x | PyDeequ 2.0 (DuckDB) | PyDeequ 2.0 (Spark) |
50+
|---------|:-----------:|:--------------------:|:-------------------:|
51+
| **Constraint Verification** | | | |
52+
| VerificationSuite | Yes | Yes | Yes |
53+
| Check constraints | Yes | Yes | Yes |
54+
| Custom SQL expressions | Yes | Yes | Yes |
55+
| **Metrics & Analysis** | | | |
56+
| AnalysisRunner | Yes | Yes | Yes |
57+
| All standard analyzers | Yes | Yes | Yes |
58+
| **Column Profiling** | | | |
59+
| ColumnProfilerRunner | Yes | Yes | Yes |
60+
| Numeric statistics | Yes | Yes | Yes |
61+
| KLL sketch profiling | Yes | No | Yes |
62+
| Low-cardinality histograms | Yes | Yes | Yes |
63+
| **Constraint Suggestions** | | | |
64+
| ConstraintSuggestionRunner | Yes | Yes | Yes |
65+
| Rule sets (DEFAULT, EXTENDED, etc.) | Yes | Yes | Yes |
66+
| Train/test split evaluation | Yes | No | Yes |
67+
| **Metrics Repository** | | | |
68+
| FileSystemMetricsRepository | Yes | Planned | Planned |
69+
| **Execution Environment** | | | |
70+
| JVM Required | Yes | No | Yes |
71+
| Local execution | Yes | Yes | Yes |
72+
| Remote execution | No | No | Yes |
7273

7374
---
7475

75-
## PyDeequ 2.0 Beta - Quick Start
76+
## Installation
77+
78+
PyDeequ 2.0 supports multiple backends. Install only what you need:
79+
80+
**From PyPI (when published):**
81+
```bash
82+
# DuckDB backend (lightweight, no JVM required)
83+
pip install pydeequ[duckdb]
84+
85+
# Spark Connect backend (for production-scale processing)
86+
pip install pydeequ[spark]
87+
88+
# Both backends
89+
pip install pydeequ[all]
90+
91+
# Development (includes all backends + test tools)
92+
pip install pydeequ[dev]
93+
```
94+
95+
**From GitHub Release (beta):**
96+
```bash
97+
# Install beta wheel + DuckDB
98+
pip install https://github.com/awslabs/python-deequ/releases/download/v2.0.0b1/pydeequ-2.0.0b1-py3-none-any.whl
99+
pip install duckdb
100+
101+
# For Spark backend, also install:
102+
pip install pyspark[connect]==3.5.0
103+
```
104+
105+
---
106+
107+
## Quick Start with DuckDB (Recommended for Getting Started)
108+
109+
The DuckDB backend is the easiest way to get started - no JVM or Spark server required.
110+
111+
### Requirements
112+
- Python 3.9+
113+
114+
### Installation
115+
116+
```bash
117+
pip install pydeequ[duckdb]
118+
```
119+
120+
### Run Your First Check
121+
122+
```python
123+
import duckdb
124+
import pydeequ
125+
from pydeequ.v2.analyzers import Size, Completeness, Mean
126+
from pydeequ.v2.checks import Check, CheckLevel
127+
from pydeequ.v2.predicates import eq, gte
128+
129+
# Create a DuckDB connection and load data
130+
con = duckdb.connect()
131+
con.execute("""
132+
CREATE TABLE users AS SELECT * FROM (VALUES
133+
(1, 'Alice', 25),
134+
(2, 'Bob', 30),
135+
(3, 'Charlie', NULL)
136+
) AS t(id, name, age)
137+
""")
138+
139+
# Create an engine from the connection
140+
engine = pydeequ.connect(con, table="users")
141+
142+
# Run analyzers
143+
metrics = engine.compute_metrics([
144+
Size(),
145+
Completeness("id"),
146+
Completeness("age"),
147+
Mean("age"),
148+
])
149+
print("Metrics:")
150+
for m in metrics:
151+
print(f" {m.name}({m.instance}): {m.value}")
152+
153+
# Run constraint checks
154+
check = (Check(CheckLevel.Error, "Data quality checks")
155+
.hasSize(eq(3))
156+
.isComplete("id")
157+
.isComplete("name")
158+
.hasCompleteness("age", gte(0.5)))
159+
160+
results = engine.run_checks([check])
161+
print("\nConstraint Results:")
162+
for r in results:
163+
print(f" {r.constraint}: {r.constraint_status}")
164+
165+
# Profile columns
166+
profiles = engine.profile_columns()
167+
print("\nColumn Profiles:")
168+
for p in profiles:
169+
print(f" {p.column}: completeness={p.completeness}, distinct={p.approx_distinct_values}")
170+
171+
con.close()
172+
```
173+
174+
---
175+
176+
## Quick Start with Spark Connect (Production Scale)
177+
178+
For production workloads and large-scale data processing, use the Spark Connect backend.
76179

77180
### Requirements
78181

@@ -142,6 +245,11 @@ pip install pyspark[connect]==3.5.0
142245
pip install setuptools
143246
```
144247

248+
Or using the extras syntax (once published to PyPI):
249+
```bash
250+
pip install pydeequ[spark]
251+
```
252+
145253
### Step 5: Run Your First Check
146254

147255
```python
@@ -444,7 +552,8 @@ The legacy PyDeequ API uses Py4J for JVM communication. It is still available fo
444552
### Installation
445553

446554
```bash
447-
pip install pydeequ
555+
# Install with Spark backend (required for 1.x API)
556+
pip install pydeequ[spark]
448557
```
449558

450559
**Note:** Set the `SPARK_VERSION` environment variable to match your Spark version.
@@ -638,15 +747,26 @@ sdk install spark 3.5.0
638747
### Poetry
639748

640749
```bash
641-
poetry install
750+
# Install all dependencies (including dev tools and both backends)
751+
poetry install --with dev --all-extras
752+
753+
# Or install specific extras
754+
poetry install --extras duckdb # DuckDB only
755+
poetry install --extras spark # Spark only
756+
poetry install --extras all # Both backends
757+
642758
poetry update
643759
poetry show -o
644760
```
645761

646762
### Running Tests Locally
647763

648764
```bash
765+
# Run all tests (requires Spark Connect server for comparison tests)
649766
poetry run pytest
767+
768+
# Run DuckDB-only tests (no Spark required)
769+
poetry run pytest tests/engines/test_duckdb*.py tests/engines/test_operators.py
650770
```
651771

652772
### Running Tests (Docker)

0 commit comments

Comments
 (0)