awslabs
diff --git a/‎README.md‎
Lines changed: 104 additions & 23 deletions b/‎README.md‎
Lines changed: 104 additions & 23 deletions
@@ -1,19 +1,19 @@
-# PyDeequ 
+# PyDeequ
 
 PyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.
 
 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)
 
-There are 4 main components of Deequ, and they are: 
-- Metrics Computation: 
-    - `Profiles` leverages Analyzers to analyze each column of a dataset. 
-    - `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale. 
-- Constraint Suggestion: 
+There are 4 main components of Deequ, and they are:
+- Metrics Computation:
+    - `Profiles` leverages Analyzers to analyze each column of a dataset.
+    - `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.
+- Constraint Suggestion:
     - Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.
-- Constraint Verification: 
-    - Perform data validation on a dataset with respect to various constraints set by you.   
+- Constraint Verification:
+    - Perform data validation on a dataset with respect to various constraints set by you.
 - Metrics Repository
-    - Allows for persistence and tracking of Deequ runs over time. 
+    - Allows for persistence and tracking of Deequ runs over time.
 
 ![](imgs/pydeequ_architecture.jpg)
 
@@ -32,9 +32,9 @@ You can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).
 
 ```
 pip install pydeequ
-``` 
+```
 
-### Set up a PySpark session 
+### Set up a PySpark session
 ```python
 from pyspark.sql import SparkSession, Row
 import pydeequ
@@ -51,7 +51,7 @@ df = spark.sparkContext.parallelize([
             Row(a="baz", b=3, c=None)]).toDF()
 ```
 
-### Analyzers 
+### Analyzers
 
 ```python
 from pydeequ.analyzers import *
@@ -61,12 +61,12 @@ analysisResult = AnalysisRunner(spark) \
                     .addAnalyzer(Size()) \
                     .addAnalyzer(Completeness("b")) \
                     .run()
-                    
+
 analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
 analysisResult_df.show()
 ```
 
-### Profile 
+### Profile
 
 ```python
 from pydeequ.profiles import *
@@ -79,7 +79,7 @@ for col, profile in result.profiles.items():
     print(profile)
 ```
 
-### Constraint Suggestions 
+### Constraint Suggestions
 
 ```python
 from pydeequ.suggestions import *
@@ -90,10 +90,10 @@ suggestionResult = ConstraintSuggestionRunner(spark) \
              .run()
 
 # Constraint Suggestions in JSON format
-print(suggestionResult) 
+print(suggestionResult)
 ```
 
-### Constraint Verification 
+### Constraint Verification
 
 ```python
 from pydeequ.checks import *
@@ -111,14 +111,14 @@ checkResult = VerificationSuite(spark) \
         .isContainedIn("a", ["foo", "bar", "baz"]) \
         .isNonNegative("b")) \
     .run()
-    
+
 checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
 checkResult_df.show()
 ```
 
-### Repository 
+### Repository
 
-Save to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner. 
+Save to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.
 ```python
 from pydeequ.repository import *
 from pydeequ.analyzers import *
@@ -136,19 +136,100 @@ analysisResult = AnalysisRunner(spark) \
     .run()
 ```
 
-To load previous runs, use the `repository` object to load previous results back in. 
+To load previous runs, use the `repository` object to load previous results back in.
 
 ```python
 result_metrep_df = repository.load() \
-    .before(ResultKey.current_milli_time()) \ 
+    .before(ResultKey.current_milli_time()) \
     .forAnalyzers([ApproxCountDistinct('b')]) \
     .getSuccessMetricsAsDataFrame()
 ```
 
 ## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)
-Please refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ. 
+Please refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.
 
 ## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)
 
 This library is licensed under the Apache 2.0 License.
 
+## Getting Started
+
+1. Setup [SDKMAN](#setup-sdkman)
+1. Setup [Java](#setup-java)
+1. Setup [Apache Spark](#setup-apache-spark)
+1. Install [Poetry](#poetry)
+1. Install Pre-commit and [follow instruction in here](PreCommit.MD)
+1. Run [tests locally](#running-tests-locally)
+
+### Setup SDKMAN
+
+SDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based
+system. It provides a convenient command line interface for installing, switching, removing and listing
+Candidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See
+documentation on the [SDKMAN! website](https://sdkman.io).
+
+Open your favourite terminal and enter the following:
+
+```bash
+$ curl -s https://get.sdkman.io | bash
+If the environment needs tweaking for SDKMAN to be installed,
+the installer will prompt you accordingly and ask you to restart.
+
+Next, open a new terminal or enter:
+
+$ source "$HOME/.sdkman/bin/sdkman-init.sh"
+
+Lastly, run the following code snippet to ensure that installation succeeded:
+
+$ sdk version
+```
+
+### Setup Java
+
+Install Java Now open favourite terminal and enter the following:
+
+```bash
+List the AdoptOpenJDK OpenJDK versions
+$ sdk list java
+
+To install For Java 11
+$ sdk install java 11.0.10.hs-adpt
+
+To install For Java 11
+$ sdk install java 8.0.292.hs-adpt
+```
+
+### Setup Apache Spark
+
+Install Java Now open favourite terminal and enter the following:
+
+```bash
+List the Apache Spark versions:
+$ sdk list spark
+
+To install For Spark 3
+$ sdk install spark 3.0.2
+```
+
+### Poetry
+
+Poetry [Commands](https://python-poetry.org/docs/cli/#search)
+
+```bash
+poetry install
+
+poetry update
+
+# --tree: List the dependencies as a tree.
+# --latest (-l): Show the latest version.
+# --outdated (-o): Show the latest version but only for packages that are outdated.
+poetry show -o
+```
+
+## Running Tests Locally
+
+Take a look at tests in `tests/dataquality` and `tests/jobs`
+
+```bash
+$ poetry run pytest
+```