Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/build-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Build Sphinx docs

on:
push:
branches: [ "add-further-missing-value-polluters" ]
pull_request:
branches: [ "add-further-missing-value-polluters" ]

permissions:
contents: write

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.8
uses: actions/setup-python@v3
with:
python-version: "3.8"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[docs]
- name: Build documentation
run: |
cd docs
make html
- name: Push documentation
run: |
git config user.name github-actions
git config user.email [email protected]
git add build/sphinx/* -f
git commit -m "Added documentation HTML files"
git push
61 changes: 42 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,40 @@

## Overview

__Jenga__ is an open source experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models.
__Jenga__ is an open source experimentation library that allows data science practitioners and researchers to study
the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of
their ML models.

We design Jenga around three core abstractions:

* [Tasks](tasks) contain a raw dataset, an ML model and a prediction task
* [Data corruptions](corruptions) take raw input data and randomly apply certain data errors to them (e.g., missing values)
* [Evaluators](evaluation) take a task and data corruptions, and execute the evaluation by repeatedly corrupting the test data of the task, and recording the predictive performance of the model on the corrupted test data.
* [Tasks](tasks) contain a raw dataset, an ML model and a prediction task
* [Data corruptions](corruptions) take raw input data and randomly apply certain data errors to them (e.g., missing
values)
* [Evaluators](evaluation) take a task and data corruptions, and execute the evaluation by repeatedly corrupting the
test data of the task, and recording the predictive performance of the model on the corrupted test data.

Jenga's goal is assist data scientists with detecting such errors early, so that they can protected their models against them. We provide a [jupyter notebook outlining the most basic usage of Jenga](notebooks/basic-example.ipynb).
Jenga's goal is assist data scientists with detecting such errors early, so that they can protected their models against
them. We provide a [jupyter notebook outlining the most basic usage of Jenga](notebooks/basic-example.ipynb).

Note that you can implement custom tasks and data corruptions by extending the corresponding provided [base classes](https://github.com/schelterlabs/jenga/blob/master/jenga/basis.py).
Note that you can implement custom tasks and data corruptions by extending the corresponding
provided [base classes](https://github.com/schelterlabs/jenga/blob/master/jenga/basis.py).

We additionally provide three advanced usage examples of Jenga:
* [Studying the impact of missing values](notebooks/example-missing-value-imputation.ipynb)
* [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb)
* [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb)

* [Studying the impact of missing values](notebooks/example-missing-value-imputation.ipynb)
* [Stress testing a feature schema](notebooks/example-schema-stresstest.ipynb)
* [Evaluating the helpfulness of data augmentation for an image recognition task](notebooks/example-image-augmentation.ipynb)

## Requirements

To proceed with the installation of Jenga, the following requirements must be met:

* Python between version 3.7 and 3.11 is required.
* An operating system different from Microsoft Windows is preferable; when using Windows, WSL must be used.

Both requirements are inherited from [TensorFlow](https://www.tensorflow.org/),
respectively the [`tensorflow-data-validation` package](https://github.com/tensorflow/data-validation),
which is required for Jenga's `validation` package extra.

## Installation

Expand All @@ -31,16 +48,19 @@ pip install jenga[image] # also installs tensorflow ad image corruption/aug
pip install jenga[validation] # also install tensorflow and tensorflow-data-validation necessary for SchemaStresstest
```


## Research

__Jenga__ is based on experiences and code from our ongoing research efforts:

* Sebastian Schelter, Tammo Rukat, Felix Biessmann (2020). [Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.](https://ssc.io/pdf/mod0077s.pdf) ACM SIGMOD.
* Tammo Rukat, Dustin Lange, Sebastian Schelter, Felix Biessmann (2020): [Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality.](https://ssc.io/pdf/autoops.pdf) ML Ops workshop at the Conference on Machine Learning and Systems (MLSys).
* Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David Salinas (2019). [DataWig - Missing Value Imputation for Tables.](https://ssc.io/pdf/datawig.pdf) JMLR (open source track)


* Sebastian Schelter, Tammo Rukat, Felix Biessmann (
2020). [Learning to Validate the Predictions of Black Box Classifiers on Unseen Data.](https://ssc.io/pdf/mod0077s.pdf)
ACM SIGMOD.
* Tammo Rukat, Dustin Lange, Sebastian Schelter, Felix Biessmann (
2020): [Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality.](https://ssc.io/pdf/autoops.pdf)
ML Ops workshop at the Conference on Machine Learning and Systems (MLSys).
* Felix Biessmann, Tammo Rukat, Philipp Schmidt, Prathik Naidu, Sebastian Schelter, Andrey Taptunov, Dustin Lange, David
Salinas (2019). [DataWig - Missing Value Imputation for Tables.](https://ssc.io/pdf/datawig.pdf) JMLR (open source
track)

## Dependency Management & Reproducibility

Expand All @@ -57,10 +77,9 @@ __Jenga__ is based on experiences and code from our ongoing research efforts:
conda env update -f environment.lock.yaml --prune
```

## Installation for Development

## Installation for Development

In order to set up the necessary environment:
In order to set up the necessary environment:

1. create an environment `jenga` with the help of [conda],
```
Expand All @@ -86,15 +105,19 @@ Optional and needed only once after `git clone`:

Then take a look into the `notebooks` folder.


## Note

This project has been set up using PyScaffold 3.2.2 and the [dsproject extension] 0.4.
For details and usage information on PyScaffold see https://pyscaffold.org/.

[conda]: https://docs.conda.io/

[pre-commit]: https://pre-commit.com/

[Jupyter]: https://jupyter.org/

[nbstripout]: https://github.com/kynan/nbstripout

[Google style]: http://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings

[dsproject extension]: https://github.com/pyscaffold/pyscaffoldext-dsproject
10 changes: 5 additions & 5 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
name = jenga
description = Jenga is an open source experimentation library that allows data science practititioners and researchers to study the effect of common data corruptions (e.g., missing values, broken character encodings) on the prediction quality of their ML models.
author = Sebastian Schelter
author-email = [email protected]
author_email = [email protected]
license = gpl3
long-description = file: README.md
long-description-content-type = text/markdown; charset=UTF-8; variant=GFM
long_description = file: README.md
long_description_content_type = text/markdown; charset=UTF-8; variant=GFM
url = https://github.com/schelterlabs/jenga
project-urls =
project_urls =
Documentation = https://github.com/schelterlabs/jenga
# Change if running only on Windows, Mac or Linux (comma-separated)
platforms = any
Expand All @@ -35,7 +35,7 @@ setup_requires = pyscaffold>=3.2a0,<3.3a0
# tests_require = pytest; pytest-cov
# Require a specific Python version, e.g. Python 2.7 or >= 3.4
# python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*
python_requires = >=3.7.*,<3.10.*
python_requires = >=3.7,<3.12
install_requires =
scikit-learn
pandas
Expand Down