diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml new file mode 100644 index 0000000..9a62dee --- /dev/null +++ b/.github/workflows/build-docs.yml @@ -0,0 +1,40 @@ +# This workflow will install Python dependencies, run tests and lint with a single version of Python +# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python + +name: Build Sphinx docs + +on: + push: + branches: [ "add-further-missing-value-polluters" ] + pull_request: + branches: [ "add-further-missing-value-polluters" ] + +permissions: + contents: write + +jobs: + build: + + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v4 + - name: Set up Python 3.8 + uses: actions/setup-python@v3 + with: + python-version: "3.8" + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install .[docs] + - name: Build documentation + run: | + cd docs + make html + - name: Push documentation + run: | + git config user.name github-actions + git config user.email github-actions@github.com + git add build/sphinx/* -f + git commit -m "Added documentation HTML files" + git push diff --git a/README.md b/README.md index 74220b5..105c7ae 100644 --- a/README.md +++ b/README.md @@ -2,23 +2,40 @@ ## Overview -__Jenga__ is an open source experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models. +__Jenga__ is an open source experimentation library that allows data science practitioners and researchers to study +the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of +their ML models. We design Jenga around three core abstractions: - * [Tasks](tasks) contain a raw dataset, an ML model and a prediction task - * [Data corruptions](corruptions) take raw input data and randomly apply certain data errors to them (e.g., missing values) - * [Evaluators](evaluation) take a task and data corruptions, and execute the evaluation by repeatedly corrupting the test data of the task, and recording the predictive performance of the model on the corrupted test data. +* [Tasks](tasks) contain a raw dataset, an ML model and a prediction task +* [Data corruptions](corruptions) take raw input data and randomly apply certain data errors to them (e.g., missing + values) +* [Evaluators](evaluation) take a task and data corruptions, and execute the evaluation by repeatedly corrupting the + test data of the task, and recording the predictive performance of the model on the corrupted test data. -Jenga's goal is assist data scientists with detecting such errors early, so that they can protected their models against them. We provide a [jupyter notebook outlining the most basic usage of Jenga](notebooks/basic-example.ipynb). +Jenga's goal is assist data scientists with detecting such errors early, so that they can protected their models against +them. We provide a [jupyter notebook outlining the most basic usage of Jenga](notebooks/basic-example.ipynb). -Note that you can implement custom tasks and data corruptions by extending the corresponding provided [base classes](https://github.com/schelterlabs/jenga/blob/master/jenga/basis.py). +Note that you can implement custom tasks and data corruptions by extending the corresponding +provided [base classes](https://github.com/schelterlabs/jenga/blob/master/jenga/basis.py). We additionally provide three advanced usage examples of Jenga: - * [Studying the impact of missing values](notebooks/example-missing-value-imputation.ipynb) - * [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb) - * [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb) +* [Studying the impact of missing values](notebooks/example-missing-value-imputation.ipynb) +* [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb) +* [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb) + +## Requirements + +To proceed with the installation of Jenga, the following requirements must be met: + +* Python between version 3.7 and 3.11 is required. +* An operating system different from Microsoft Windows is preferable; when using Windows, WSL must be used. + +Both requirements are inherited from [TensorFlow](https://www.tensorflow.org/), +respectively the [`tensorflow-data-validation` package](https://github.com/tensorflow/data-validation), +which is required for Jenga's `validation` package extra. ## Installation @@ -31,16 +48,19 @@ pip install jenga[image] # also installs tensorflow ad image corruption/aug pip install jenga[validation] # also install tensorflow and tensorflow-data-validation necessary for SchemaStresstest ``` - ## Research __Jenga__ is based on experiences and code from our ongoing research efforts: - * Sebastian Schelter, Tammo Rukat, Felix Biessmann (2020). [Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.](https://ssc.io/pdf/mod0077s.pdf) ACM SIGMOD. - * Tammo Rukat, Dustin Lange, Sebastian Schelter, Felix Biessmann (2020): [Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality.](https://ssc.io/pdf/autoops.pdf) ML Ops workshop at the Conference on Machine Learning and Systems (MLSys). - * Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David Salinas (2019). [DataWig - Missing Value Imputation for Tables.](https://ssc.io/pdf/datawig.pdf) JMLR (open source track) - - +* Sebastian Schelter, Tammo Rukat, Felix Biessmann ( + 2020). [Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.](https://ssc.io/pdf/mod0077s.pdf) + ACM SIGMOD. +* Tammo Rukat, Dustin Lange, Sebastian Schelter, Felix Biessmann ( + 2020): [Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality.](https://ssc.io/pdf/autoops.pdf) + ML Ops workshop at the Conference on Machine Learning and Systems (MLSys). +* Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David + Salinas (2019). [DataWig - Missing Value Imputation for Tables.](https://ssc.io/pdf/datawig.pdf) JMLR (open source + track) ## Dependency Management & Reproducibility @@ -57,10 +77,9 @@ __Jenga__ is based on experiences and code from our ongoing research efforts: conda env update -f environment.lock.yaml --prune ``` +## Installation for Development - ## Installation for Development - - In order to set up the necessary environment: +In order to set up the necessary environment: 1. create an environment `jenga` with the help of [conda], ``` @@ -86,15 +105,19 @@ Optional and needed only once after `git clone`: Then take a look into the `notebooks` folder. - ## Note This project has been set up using PyScaffold 3.2.2 and the [dsproject extension] 0.4. For details and usage information on PyScaffold see https://pyscaffold.org/. [conda]: https://docs.conda.io/ + [pre-commit]: https://pre-commit.com/ + [Jupyter]: https://jupyter.org/ + [nbstripout]: https://github.com/kynan/nbstripout + [Google style]: http://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings + [dsproject extension]: https://github.com/pyscaffold/pyscaffoldext-dsproject diff --git a/setup.cfg b/setup.cfg index d42087e..5c6fcfa 100644 --- a/setup.cfg +++ b/setup.cfg @@ -6,12 +6,12 @@ name = jenga description = Jenga is an open source experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models. author = Sebastian Schelter -author-email = s.schelter@uva.nl +author_email = s.schelter@uva.nl license = gpl3 -long-description = file: README.md -long-description-content-type = text/markdown; charset=UTF-8; variant=GFM +long_description = file: README.md +long_description_content_type = text/markdown; charset=UTF-8; variant=GFM url = https://github.com/schelterlabs/jenga -project-urls = +project_urls = Documentation = https://github.com/schelterlabs/jenga # Change if running only on Windows, Mac or Linux (comma-separated) platforms = any @@ -35,7 +35,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0 # tests_require = pytest; pytest-cov # Require a specific Python version, e.g. Python 2.7 or >= 3.4 # python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.* -python_requires = >=3.7.*,<3.10.* +python_requires = >=3.7,<3.12 install_requires = scikit-learn pandas