Skip to content

Commit 30375bb

Browse files
authored
feat(Spark3Support): Adding support to pyspark 3.0 (#41)
* feat(Spark3Support): Adding support to pyspark 3.0 * feat(Spark3Support): Adding poetry lock file * Adding docs on dev setup
1 parent bda44da commit 30375bb

29 files changed

+3998
-2108
lines changed

README.md

Lines changed: 104 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,19 @@
1-
# PyDeequ
1+
# PyDeequ
22

33
PyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.
44

55
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)
66

7-
There are 4 main components of Deequ, and they are:
8-
- Metrics Computation:
9-
- `Profiles` leverages Analyzers to analyze each column of a dataset.
10-
- `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.
11-
- Constraint Suggestion:
7+
There are 4 main components of Deequ, and they are:
8+
- Metrics Computation:
9+
- `Profiles` leverages Analyzers to analyze each column of a dataset.
10+
- `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.
11+
- Constraint Suggestion:
1212
- Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.
13-
- Constraint Verification:
14-
- Perform data validation on a dataset with respect to various constraints set by you.
13+
- Constraint Verification:
14+
- Perform data validation on a dataset with respect to various constraints set by you.
1515
- Metrics Repository
16-
- Allows for persistence and tracking of Deequ runs over time.
16+
- Allows for persistence and tracking of Deequ runs over time.
1717

1818
![](imgs/pydeequ_architecture.jpg)
1919

@@ -32,9 +32,9 @@ You can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).
3232

3333
```
3434
pip install pydeequ
35-
```
35+
```
3636

37-
### Set up a PySpark session
37+
### Set up a PySpark session
3838
```python
3939
from pyspark.sql import SparkSession, Row
4040
import pydeequ
@@ -51,7 +51,7 @@ df = spark.sparkContext.parallelize([
5151
Row(a="baz", b=3, c=None)]).toDF()
5252
```
5353

54-
### Analyzers
54+
### Analyzers
5555

5656
```python
5757
from pydeequ.analyzers import *
@@ -61,12 +61,12 @@ analysisResult = AnalysisRunner(spark) \
6161
.addAnalyzer(Size()) \
6262
.addAnalyzer(Completeness("b")) \
6363
.run()
64-
64+
6565
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
6666
analysisResult_df.show()
6767
```
6868

69-
### Profile
69+
### Profile
7070

7171
```python
7272
from pydeequ.profiles import *
@@ -79,7 +79,7 @@ for col, profile in result.profiles.items():
7979
print(profile)
8080
```
8181

82-
### Constraint Suggestions
82+
### Constraint Suggestions
8383

8484
```python
8585
from pydeequ.suggestions import *
@@ -90,10 +90,10 @@ suggestionResult = ConstraintSuggestionRunner(spark) \
9090
.run()
9191

9292
# Constraint Suggestions in JSON format
93-
print(suggestionResult)
93+
print(suggestionResult)
9494
```
9595

96-
### Constraint Verification
96+
### Constraint Verification
9797

9898
```python
9999
from pydeequ.checks import *
@@ -111,14 +111,14 @@ checkResult = VerificationSuite(spark) \
111111
.isContainedIn("a", ["foo", "bar", "baz"]) \
112112
.isNonNegative("b")) \
113113
.run()
114-
114+
115115
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
116116
checkResult_df.show()
117117
```
118118

119-
### Repository
119+
### Repository
120120

121-
Save to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.
121+
Save to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.
122122
```python
123123
from pydeequ.repository import *
124124
from pydeequ.analyzers import *
@@ -136,19 +136,100 @@ analysisResult = AnalysisRunner(spark) \
136136
.run()
137137
```
138138

139-
To load previous runs, use the `repository` object to load previous results back in.
139+
To load previous runs, use the `repository` object to load previous results back in.
140140

141141
```python
142142
result_metrep_df = repository.load() \
143-
.before(ResultKey.current_milli_time()) \
143+
.before(ResultKey.current_milli_time()) \
144144
.forAnalyzers([ApproxCountDistinct('b')]) \
145145
.getSuccessMetricsAsDataFrame()
146146
```
147147

148148
## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)
149-
Please refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.
149+
Please refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.
150150

151151
## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)
152152

153153
This library is licensed under the Apache 2.0 License.
154154

155+
## Getting Started
156+
157+
1. Setup [SDKMAN](#setup-sdkman)
158+
1. Setup [Java](#setup-java)
159+
1. Setup [Apache Spark](#setup-apache-spark)
160+
1. Install [Poetry](#poetry)
161+
1. Install Pre-commit and [follow instruction in here](PreCommit.MD)
162+
1. Run [tests locally](#running-tests-locally)
163+
164+
### Setup SDKMAN
165+
166+
SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based
167+
system. It provides a convenient command line interface for installing, switching, removing and listing
168+
Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See
169+
documentation on the [SDKMAN! website](https://sdkman.io).
170+
171+
Open your favourite terminal and enter the following:
172+
173+
```bash
174+
$ curl -s https://get.sdkman.io | bash
175+
If the environment needs tweaking for SDKMAN to be installed,
176+
the installer will prompt you accordingly and ask you to restart.
177+
178+
Next, open a new terminal or enter:
179+
180+
$ source "$HOME/.sdkman/bin/sdkman-init.sh"
181+
182+
Lastly, run the following code snippet to ensure that installation succeeded:
183+
184+
$ sdk version
185+
```
186+
187+
### Setup Java
188+
189+
Install Java Now open favourite terminal and enter the following:
190+
191+
```bash
192+
List the AdoptOpenJDK OpenJDK versions
193+
$ sdk list java
194+
195+
To install For Java 11
196+
$ sdk install java 11.0.10.hs-adpt
197+
198+
To install For Java 11
199+
$ sdk install java 8.0.292.hs-adpt
200+
```
201+
202+
### Setup Apache Spark
203+
204+
Install Java Now open favourite terminal and enter the following:
205+
206+
```bash
207+
List the Apache Spark versions:
208+
$ sdk list spark
209+
210+
To install For Spark 3
211+
$ sdk install spark 3.0.2
212+
```
213+
214+
### Poetry
215+
216+
Poetry [Commands](https://python-poetry.org/docs/cli/#search)
217+
218+
```bash
219+
poetry install
220+
221+
poetry update
222+
223+
# --tree: List the dependencies as a tree.
224+
# --latest (-l): Show the latest version.
225+
# --outdated (-o): Show the latest version but only for packages that are outdated.
226+
poetry show -o
227+
```
228+
229+
## Running Tests Locally
230+
231+
Take a look at tests in `tests/dataquality` and `tests/jobs`
232+
233+
```bash
234+
$ poetry run pytest
235+
```

0 commit comments

Comments
 (0)