From a080b327d163cd3d4b26ff47f0b89e17ecb3a0b3 Mon Sep 17 00:00:00 2001 From: Johannes Schrott Date: Fri, 27 Dec 2024 17:25:41 +0100 Subject: [PATCH 1/5] Update README.md Reformatted markdown and fixed typos --- README.md | 50 +++++++++++++++++++++++++++++++------------------- 1 file changed, 31 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 74220b5..4d5a2c3 100644 --- a/README.md +++ b/README.md @@ -2,23 +2,29 @@ ## Overview -__Jenga__ is an open source experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models. +__Jenga__ is an open source experimentation library that allows data science practitioners and researchers to study +the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of +their ML models. We design Jenga around three core abstractions: - * [Tasks](tasks) contain a raw dataset, an ML model and a prediction task - * [Data corruptions](corruptions) take raw input data and randomly apply certain data errors to them (e.g., missing values) - * [Evaluators](evaluation) take a task and data corruptions, and execute the evaluation by repeatedly corrupting the test data of the task, and recording the predictive performance of the model on the corrupted test data. +* [Tasks](tasks) contain a raw dataset, an ML model and a prediction task +* [Data corruptions](corruptions) take raw input data and randomly apply certain data errors to them (e.g., missing + values) +* [Evaluators](evaluation) take a task and data corruptions, and execute the evaluation by repeatedly corrupting the + test data of the task, and recording the predictive performance of the model on the corrupted test data. -Jenga's goal is assist data scientists with detecting such errors early, so that they can protected their models against them. We provide a [jupyter notebook outlining the most basic usage of Jenga](notebooks/basic-example.ipynb). +Jenga's goal is assist data scientists with detecting such errors early, so that they can protected their models against +them. We provide a [jupyter notebook outlining the most basic usage of Jenga](notebooks/basic-example.ipynb). -Note that you can implement custom tasks and data corruptions by extending the corresponding provided [base classes](https://github.com/schelterlabs/jenga/blob/master/jenga/basis.py). +Note that you can implement custom tasks and data corruptions by extending the corresponding +provided [base classes](https://github.com/schelterlabs/jenga/blob/master/jenga/basis.py). We additionally provide three advanced usage examples of Jenga: - * [Studying the impact of missing values](notebooks/example-missing-value-imputation.ipynb) - * [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb) - * [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb) +* [Studying the impact of missing values](notebooks/example-missing-value-imputation.ipynb) +* [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb) +* [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb) ## Installation @@ -31,16 +37,19 @@ pip install jenga[image] # also installs tensorflow ad image corruption/aug pip install jenga[validation] # also install tensorflow and tensorflow-data-validation necessary for SchemaStresstest ``` - ## Research __Jenga__ is based on experiences and code from our ongoing research efforts: - * Sebastian Schelter, Tammo Rukat, Felix Biessmann (2020). [Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.](https://ssc.io/pdf/mod0077s.pdf) ACM SIGMOD. - * Tammo Rukat, Dustin Lange, Sebastian Schelter, Felix Biessmann (2020): [Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality.](https://ssc.io/pdf/autoops.pdf) ML Ops workshop at the Conference on Machine Learning and Systems (MLSys). - * Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David Salinas (2019). [DataWig - Missing Value Imputation for Tables.](https://ssc.io/pdf/datawig.pdf) JMLR (open source track) - - +* Sebastian Schelter, Tammo Rukat, Felix Biessmann ( + 2020). [Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.](https://ssc.io/pdf/mod0077s.pdf) + ACM SIGMOD. +* Tammo Rukat, Dustin Lange, Sebastian Schelter, Felix Biessmann ( + 2020): [Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality.](https://ssc.io/pdf/autoops.pdf) + ML Ops workshop at the Conference on Machine Learning and Systems (MLSys). +* Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David + Salinas (2019). [DataWig - Missing Value Imputation for Tables.](https://ssc.io/pdf/datawig.pdf) JMLR (open source + track) ## Dependency Management & Reproducibility @@ -57,10 +66,9 @@ __Jenga__ is based on experiences and code from our ongoing research efforts: conda env update -f environment.lock.yaml --prune ``` +## Installation for Development - ## Installation for Development - - In order to set up the necessary environment: +In order to set up the necessary environment: 1. create an environment `jenga` with the help of [conda], ``` @@ -86,15 +94,19 @@ Optional and needed only once after `git clone`: Then take a look into the `notebooks` folder. - ## Note This project has been set up using PyScaffold 3.2.2 and the [dsproject extension] 0.4. For details and usage information on PyScaffold see https://pyscaffold.org/. [conda]: https://docs.conda.io/ + [pre-commit]: https://pre-commit.com/ + [Jupyter]: https://jupyter.org/ + [nbstripout]: https://github.com/kynan/nbstripout + [Google style]: http://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings + [dsproject extension]: https://github.com/pyscaffold/pyscaffoldext-dsproject From e82c99fda7fe7682f11f5b96f2305c1df911c4de Mon Sep 17 00:00:00 2001 From: Johannes Schrott Date: Fri, 27 Dec 2024 17:37:07 +0100 Subject: [PATCH 2/5] Update README.md Added an section on Jenga's software requirements --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index 4d5a2c3..105c7ae 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,17 @@ We additionally provide three advanced usage examples of Jenga: * [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb) * [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb) +## Requirements + +To proceed with the installation of Jenga, the following requirements must be met: + +* Python between version 3.7 and 3.11 is required. +* An operating system different from Microsoft Windows is preferable; when using Windows, WSL must be used. + +Both requirements are inherited from [TensorFlow](https://www.tensorflow.org/), +respectively the [`tensorflow-data-validation` package](https://github.com/tensorflow/data-validation), +which is required for Jenga's `validation` package extra. + ## Installation The following options are possible: From 07487643f855fbf71c1646bdca7e508e7ad2f198 Mon Sep 17 00:00:00 2001 From: Johannes Schrott Date: Fri, 27 Dec 2024 17:38:29 +0100 Subject: [PATCH 3/5] Update setup.cfg - Adjusted the keys in the config file to match current versions of "setuptools" - Adjusted the supported Python versions in accordance to the dependencies (--> tensorflow-data-validation!) --- setup.cfg | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/setup.cfg b/setup.cfg index d42087e..425daea 100644 --- a/setup.cfg +++ b/setup.cfg @@ -6,12 +6,12 @@ name = jenga description = Jenga is an open source experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models. author = Sebastian Schelter -author-email = s.schelter@uva.nl +author_email = s.schelter@uva.nl license = gpl3 -long-description = file: README.md -long-description-content-type = text/markdown; charset=UTF-8; variant=GFM +long_description = file: README.md +long_description_content_type = text/markdown; charset=UTF-8; variant=GFM url = https://github.com/schelterlabs/jenga -project-urls = +project_urls = Documentation = https://github.com/schelterlabs/jenga # Change if running only on Windows, Mac or Linux (comma-separated) platforms = any @@ -35,7 +35,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0 # tests_require = pytest; pytest-cov # Require a specific Python version, e.g. Python 2.7 or >= 3.4 # python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.* -python_requires = >=3.7.*,<3.10.* +python_requires = >=3.7,<3.11 install_requires = scikit-learn pandas From 40a13c407ca32ac78346e6902246620845e6dde4 Mon Sep 17 00:00:00 2001 From: Johannes Schrott Date: Fri, 27 Dec 2024 17:41:59 +0100 Subject: [PATCH 4/5] Update setup.cfg --- setup.cfg | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/setup.cfg b/setup.cfg index 425daea..5c6fcfa 100644 --- a/setup.cfg +++ b/setup.cfg @@ -35,7 +35,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0 # tests_require = pytest; pytest-cov # Require a specific Python version, e.g. Python 2.7 or >= 3.4 # python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.* -python_requires = >=3.7,<3.11 +python_requires = >=3.7,<3.12 install_requires = scikit-learn pandas From 5065a63237c5c3401a804d04ef7badbd0daed230 Mon Sep 17 00:00:00 2001 From: Johannes Schrott <23276756+johannesschrott@users.noreply.github.com> Date: Mon, 27 Jan 2025 12:59:43 +0100 Subject: [PATCH 5/5] Create build-docs.yml --- .github/workflows/build-docs.yml | 40 ++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 .github/workflows/build-docs.yml diff --git a/.github/workflows/build-docs.yml b/.github/workflows/build-docs.yml new file mode 100644 index 0000000..9a62dee --- /dev/null +++ b/.github/workflows/build-docs.yml @@ -0,0 +1,40 @@ +# This workflow will install Python dependencies, run tests and lint with a single version of Python +# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python + +name: Build Sphinx docs + +on: + push: + branches: [ "add-further-missing-value-polluters" ] + pull_request: + branches: [ "add-further-missing-value-polluters" ] + +permissions: + contents: write + +jobs: + build: + + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v4 + - name: Set up Python 3.8 + uses: actions/setup-python@v3 + with: + python-version: "3.8" + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install .[docs] + - name: Build documentation + run: | + cd docs + make html + - name: Push documentation + run: | + git config user.name github-actions + git config user.email github-actions@github.com + git add build/sphinx/* -f + git commit -m "Added documentation HTML files" + git push