Skip to content

Commit d138abc

Browse files
author
Nabil Fayak
committed
initial commit
1 parent 4157944 commit d138abc

File tree

79 files changed

+6232
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

79 files changed

+6232
-0
lines changed

.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
.venv/
2+
**/__pycache__/
3+
.DS_Store
4+
CheckMate.egg-info/

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.9.7

Makefile

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
TIMEOUT ?= 300
2+
3+
.PHONY: clean
4+
clean:
5+
find . -name '*.pyo' -delete
6+
find . -name '*.pyc' -delete
7+
find . -name __pycache__ -delete
8+
find . -name '*~' -delete
9+
find . -name '.coverage.*' -delete
10+
11+
.PHONY: lint
12+
lint:
13+
python docs/notebook_version_standardizer.py check-versions
14+
python docs/notebook_version_standardizer.py check-execution
15+
black . --check --config=./pyproject.toml
16+
ruff . --config=./pyproject.toml
17+
18+
.PHONY: lint-fix
19+
lint-fix:
20+
python docs/notebook_version_standardizer.py standardize
21+
black . --config=./pyproject.toml
22+
ruff . --config=./pyproject.toml --fix
23+
24+
.PHONY: installdeps
25+
installdeps:
26+
pip install --upgrade pip
27+
pip install -e .
28+
29+
.PHONY: installdeps-min
30+
installdeps-min:
31+
pip install --upgrade pip -q
32+
pip install -e . --no-dependencies
33+
pip install -r tests/dependency_update_check/minimum_test_requirements.txt
34+
pip install -r tests/dependency_update_check/minimum_requirements.txt
35+
36+
.PHONY: installdeps-prophet
37+
installdeps-prophet:
38+
pip install -e .[prophet]
39+
40+
.PHONY: installdeps-test
41+
installdeps-test:
42+
pip install -e .[test]
43+
44+
.PHONY: installdeps-dev
45+
installdeps-dev:
46+
pip install -e .[dev]
47+
pre-commit install
48+
49+
.PHONY: installdeps-docs
50+
installdeps-docs:
51+
pip install -e .[docs]
52+
53+
.PHONY: upgradepip
54+
upgradepip:
55+
python -m pip install --upgrade pip
56+
57+
.PHONY: upgradebuild
58+
upgradebuild:
59+
python -m pip install --upgrade build
60+
61+
.PHONY: upgradesetuptools
62+
upgradesetuptools:
63+
python -m pip install --upgrade setuptools

README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# CheckMate
2+
3+
CheckMate is an AutoML library which catches and warns of problems with your data and problem setup before modeling.
4+
5+
## Installation
6+
7+
## Start
8+
9+
## Next Steps
10+
11+
Read more about CheckMate on our [documentation page](#):
12+
13+
## Support
14+
15+
The CheckMate community is happy to provide support to users of CheckMate. Project support can be found in four places depending on the type of question:
16+
1. For usage questions, use [Stack Overflow](#) with the `checkmate` tag.
17+
2. For bugs, issues, or feature requests start a [Github issue](#).
18+
3. For discussion regarding development on the core library, use [Slack](#).
19+
4. For everything else, the core developers can be reached by email at [email protected]
20+
21+
## Built at Alteryx
22+
23+
**CheckMate** is an open source project built by [Alteryx](https://www.alteryx.com). To see the other open source projects we’re working on visit [Alteryx Open Source](https://www.alteryx.com/open-source). If building impactful data science pipelines is important to you or your business, please get in touch.
24+
25+
<p align="center">
26+
<a href="https://www.alteryx.com/open-source">
27+
<img src="https://alteryx-oss-web-images.s3.amazonaws.com/OpenSource_Logo-01.png" alt="Alteryx Open Source" width="800"/>
28+
</a>
29+
</p>

contributing.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
## Contributing to the Codebase
2+
3+
#### 0. Look at Open Issues
4+
We currently utilize GitHub Issues as our project management tool for datachecks. Please do the following:
5+
* Look at our [open issues](#)
6+
* Find an unclaimed issue by looking for an empty `Assignees` field.
7+
* If this is your first time contributing, issues labeled ``good first issue`` are a good place to start.
8+
* If your issue is labeled `needs design` or `spike` it is recommended you provide a design document for your feature
9+
prior to submitting a pull request (PR).
10+
* Connect your PR to your issue by adding the following comment in the PR body: `Fixes #<issue-number>`
11+
12+
13+
#### 1. Clone repo
14+
The code is hosted on GitHub, so you will need to use Git to clone the project and make changes to the codebase. Once you have obtained a copy of the code, you should create a development environment that is separate from your existing Python environment so that you can make and test changes without compromising your own work environment. Additionally, you must make sure that the version of Python you use is at least 3.8. Using `conda` you can use `conda create -n datachecks python=3.8` and `conda activate datachecks` before the following steps.
15+
* clone with `git clone [https://github.com/NabilFayak/datachecks.git]`
16+
* install in edit mode with:
17+
```bash
18+
# move into the repo
19+
cd datachecks
20+
# installs the repo in edit mode, meaning changes to any files will be picked up in python. also installs all dependencies.
21+
make installdeps-dev
22+
```
23+
24+
<!--- Note that if you're on Mac, there are a few extra steps you'll want to keep track of.
25+
* In order to run on Mac, [LightGBM requires the OpenMP library to be installed](https://datachecks.alteryx.com/en/stable/install.html#Mac), which can be done with HomeBrew by running `brew install libomp`
26+
* We've seen some installs get the following warning when importing datachecks: "UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError". [A known workaround](https://stackoverflow.com/a/61531555/841003) is to run `brew reinstall readline xz` before installing the python version you're using via pyenv. If you've already installed a python version in pyenv, consider deleting it and reinstalling. v3.8.2 is known to work. --->
27+
28+
#### 2. Implement your Pull Request
29+
30+
* Implement your pull request. If needed, add new tests or update the documentation.
31+
* Before submitting to GitHub, verify the tests run and the code lints properly
32+
```bash
33+
# runs linting
34+
make lint
35+
36+
# will fix some common linting issues automatically, if the above command failed
37+
make lint-fix
38+
39+
# runs all the unit tests locally
40+
make test
41+
```
42+
43+
* If you made changes to the documentation, build the documentation to view locally.
44+
```bash
45+
# go to docs and build
46+
cd docs
47+
make html
48+
49+
# view docs locally
50+
open build/html/index.html
51+
```
52+
53+
* Before you commit, a few lint fixing hooks will run. You can also manually run these.
54+
```bash
55+
# run linting hooks only on changed files
56+
pre-commit run
57+
58+
# run linting hooks on all files
59+
pre-commit run --all-files
60+
```
61+
62+
Note that if you're building docs locally, the warning suppression code at `docs/source/disable-warnings.py` will not run, meaning you'll see python warnings appear in the docs where applicable. To suppress this, add `warnings.filterwarnings('ignore')` to `docs/source/conf.py`.
63+
64+
#### 3. Submit your Pull Request
65+
66+
* Once your changes are ready to be submitted, make sure to push your changes to GitHub before creating a pull request. Create a pull request, and our continuous integration will run automatically.
67+
68+
* Be sure to include unit tests (and docstring tests, if applicable) for your changes; these tests you write will also be run as part of the continuous integration.
69+
70+
* If your changes alter the following please fix them as well:
71+
* Docstrings - if your changes render docstrings invalid
72+
* API changes - if you change the API update `docs/source/api_reference.rst`
73+
* Documentation - run the documentation notebooks locally to ensure everything is logical and works as intended
74+
75+
* Update the "Future Release" section at the top of the release notes (`docs/source/release_notes.rst`) to include an entry for your pull request. Write your entry in past tense, i.e. "added fizzbuzz impl."
76+
77+
* Please create your pull request initially as [a "Draft" PR](https://docs.github.com/en/free-pro-team@latest/github/collaborating-with-issues-and-pull-requests/about-pull-requests#draft-pull-requests). This signals the team to ignore it and to allow you to develop. When the checkin tests are passing and you're ready to get your pull request reviewed and merged, please convert it to a normal PR for review.
78+
79+
* We use GitHub Actions to run our PR checkin tests. On creation of the PR and for every change you make to your PR, you'll need a maintainer to click "Approve and run" on your PR. This is a change [GitHub made in April 2021](https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/).
80+
81+
* We ask that all contributors sign our contributor license agreement (CLA) the first time they contribute to datachecks. The CLA assistant will place a message on your PR; follow the instructions there to sign the CLA.
82+
83+
Add a description of your PR to the subsection that most closely matches your contribution:
84+
* Enhancements: new features or additions to DataChecks.
85+
* Fixes: things like bugfixes or adding more descriptive error messages.
86+
* Changes: modifications to an existing part of DataChecks.
87+
* Documentation Changes
88+
* Testing Changes
89+
90+
If your work includes a [breaking change](https://en.wiktionary.org/wiki/breaking_change), please add a description of what has been affected in the "Breaking Changes" section below the latest release notes. If no "Breaking Changes" section yet exists, please create one as follows. See past release notes for examples of this.
91+
```
92+
.. warning::
93+
94+
**Breaking Changes**
95+
96+
* Description of your breaking change
97+
```
98+
99+
### 4. Updating our conda package
100+
101+
We maintain a conda package [package](#) to give users more options of how to install datachecks.
102+
Conda packages are created from recipes, which are yaml config files that list a package's dependencies and tests. Here is
103+
datachecks's latest published [recipe](#).
104+
GitHub repositories containing conda recipes are called `feedstocks`.
105+
106+
If you opened a PR to datachecks that modifies the packages in `dependencies` within `pyproject.toml`, or if the latest dependency bot
107+
updates the latest version of one of our packages, you will see a CI job called `build_conda_pkg`. This section describes
108+
what `build_conda_pkg` does and what to do if you see it fails in your pr.
109+
110+
#### What is build_conda_pkg?
111+
`build_conda_pkg` clones the PR branch and builds the conda package from that branch. Since the conda build process runs our
112+
entire suite of unit tests, `build_conda_pkg` checks that our conda package actually supports the proposed change of the PR.
113+
We added this check to eliminate surprises. Since the conda package is released after we release to PyPi, it's possible that
114+
we released a dependency version that is not compatible with our conda recipe. It would be a pain to try to debug this at
115+
release-time since the PyPi release includes many possible PRs that could have introduced that change.
116+
117+
#### How does `build_conda_pkg` work?
118+
`build_conda_pkg` will clone the `master` branch of the feedstock as well as you datachecks PR branch. It will
119+
then replace the recipe in the `master` branch of the feedstock with the current
120+
latest [recipe](#) in datachecks.
121+
It will also modify the [source](#)
122+
field of the local copy of the recipe and point it at the local datachecks clone of your PR branch.
123+
This has the effect of building our conda package against your PR branch!
124+
125+
#### Why does `build_conda_pkg` use a recipe in datachecks as opposed to the recipe in the feedstock `master` branch?
126+
One important fact to know about conda is that any change to the `master` branch of a feedstock will
127+
result in a new version of the conda package being published to the world!
128+
129+
With this in mind, let's say your PR requires modifying our dependencies.
130+
If we made a change to `master`, an updated version of datachecks's latest conda package would
131+
be released. This means people who installed the latest version of datachecks prior to this PR would get different dependency versions
132+
than those who installed datachecks after the PR got merged on GitHub. This is not desirable, especially because the PR would not get shipped
133+
to PyPi until the next release happens. So there would also be a discrepancy between the PyPi and conda versions.
134+
135+
By using a recipe stored in the datachecks repo, we can keep track of the changes that need to be made for the next release without
136+
having to publish a new conda package. Since the recipe is also "unique" to your PR, you are free to make whatever changes you
137+
need to make without disturbing other PRs. This would not be the case if `build_conda_pkg` ran from the `master` branch of the
138+
feedstock.
139+
140+
#### What to do if you see `build_conda_pkg` is red on your PR?
141+
It depends on the kind of PR:
142+
143+
**Case 1: You're adding a completely new dependency**
144+
145+
In this case, `build_conda_pkg` is failing simply because a dependency is missing. Adding the dependency to the recipe should
146+
make the check green. To add the dependency, modify the recipe located at `.github/meta.yaml`.
147+
148+
If you see that adding the dependency causes the build to fail, possibly because of conflicting versions, then iterate until
149+
the build passes. The team will verify if your changes make sense during PR review.
150+
151+
**Case 2: The latest dependency bot created a PR**
152+
If the latest dependency bot PR fails `build_conda_pkg`, it means our code doesn't support the latest version
153+
of one of our dependencies. This means that we either have to cap the max allowed version in our requirements file
154+
or update our code to support that version. If we opt for the former, then just like in Case 1, make the corresponding change
155+
to the recipe located at `.github/meta.yaml`
156+
157+
#### What about the `check_versions` CI check?
158+
This check verifies that the allowed versions listed in `pyproject.toml` match those listed in
159+
the conda recipe so that the PyPi requirements and conda requirements don't get out of sync.
160+
161+
## Code Style Guide
162+
163+
* Keep things simple. Any complexity must be justified in order to pass code review.
164+
* Be aware that while we love fancy python magic, there's usually a simpler solution which is easier to understand!
165+
* Make PRs as small as possible! Consider breaking your large changes into separate PRs. This will make code review easier, quicker, less bug-prone and more effective.
166+
* In the name of every branch you create, include the associated issue number if applicable.
167+
* If new changes are added to the branch you're basing your changes off of, consider using `git rebase -i base_branch` rather than merging the base branch, to keep history clean.
168+
* Always include a docstring for public methods and classes. Consider including docstrings for private methods too. We use the [Google docstring convention](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings), and use the [`sphinx.ext.napoleon`](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html) extension to parse our docstrings.
169+
* Although not explicitly enforced by the Google convention, keep the following stylistic conventions for docstrings in mind:
170+
- First letter of each argument description should be capitalized.
171+
- Docstring sentences should end in periods. This includes descriptions for each argument.
172+
- Types should be written in lower-case. For example, use "bool" instead of "Bool".
173+
- Always add the default value in the description of the argument, if applicable. For example, "Defaults to 1."
174+
* Use [PascalCase (upper camel case)](https://en.wikipedia.org/wiki/Camel_case#Variations_and_synonyms) for class names, and [snake_case](https://en.wikipedia.org/wiki/Snake_case) for method and class member names.
175+
* To distinguish private methods and class attributes from public ones, those which are private should be prefixed with an underscore
176+
* Any code which doesn't need to be public should be private. Use `@staticmethod` and `@classmethod` where applicable, to indicate no side effects.
177+
* Only call public methods in unit tests.
178+
* All code must have unit test coverage. Use mocking and monkey-patching when necessary.
179+
* Keep unit tests as fast as possible. In particular, avoid calling `fit`. Mocking can help with this.
180+
* When you're working with code which uses a random number generator, make sure your unit tests set a random seed.
181+
* Use `np.testing.assert_almost_equal` when comparing floating-point numbers, to avoid numerical precision issues, particularly cross-platform.
182+
* Use `os.path` tools to keep file paths cross-platform.
183+
* Our rule of thumb is to favor traditional inheritance over a mixin pattern.
184+
185+
## GitHub Issue Guide
186+
187+
* Make the title as short and descriptive as possible.
188+
* Make sure the body is concise and gets to the point quickly.
189+
* Check for duplicates before filing.
190+
* For bugs, a good general outline is: problem summary, reproduction steps, symptoms and scope, root cause if known, proposed solution(s), and next steps.
191+
* If the issue writeup or conversation get too long and hard to follow, consider starting a design document.
192+
* Use the appropriate labels to help your issue get triaged quickly.
193+
* Make your issues as actionable as possible. If they track open discussions, consider prefixing the title with "[Discuss]", or refining the issue further before filing.

datachecks/__init__.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""General datachecks directory."""
2+
3+
from datachecks.data_checks.checks.data_check import DataCheck
4+
from datachecks.data_checks.datacheck_meta.data_check_message_code import (
5+
DataCheckMessageCode,
6+
)
7+
from datachecks.data_checks.datacheck_meta.data_check_action import DataCheckAction
8+
from datachecks.data_checks.datacheck_meta.data_check_action_option import (
9+
DataCheckActionOption,
10+
DCAOParameterType,
11+
DCAOParameterAllowedValuesType,
12+
)
13+
from datachecks.data_checks.datacheck_meta.data_check_action_code import (
14+
DataCheckActionCode,
15+
)
16+
from datachecks.data_checks.checks.data_checks import DataChecks
17+
from datachecks.data_checks.datacheck_meta.data_check_message import (
18+
DataCheckMessage,
19+
DataCheckWarning,
20+
DataCheckError,
21+
)
22+
from datachecks.data_checks.datacheck_meta.data_check_message_type import (
23+
DataCheckMessageType,
24+
)
25+
from datachecks.data_checks.checks.id_columns_data_check import IDColumnsDataCheck
26+
27+
from datachecks.problem_types.problem_types import ProblemTypes
28+
from datachecks.problem_types.utils import (
29+
handle_problem_types,
30+
detect_problem_type,
31+
is_regression,
32+
is_binary,
33+
is_multiclass,
34+
is_classification,
35+
is_time_series,
36+
)
37+
38+
from datachecks.exceptions.exceptions import (
39+
DataCheckInitError,
40+
MissingComponentError,
41+
ValidationErrorCode,
42+
)
43+
44+
from datachecks.utils.gen_utils import classproperty
45+
from datachecks.utils.woodwork_utils import infer_feature_types

datachecks/data_checks/__init__.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
"""Base data checks and ID column data check."""
2+
3+
from datachecks.data_checks.checks.data_check import DataCheck
4+
from datachecks.data_checks.datacheck_meta.data_check_message_code import (
5+
DataCheckMessageCode,
6+
)
7+
from datachecks.data_checks.datacheck_meta.data_check_action import DataCheckAction
8+
from datachecks.data_checks.datacheck_meta.data_check_action_option import (
9+
DataCheckActionOption,
10+
DCAOParameterType,
11+
DCAOParameterAllowedValuesType,
12+
)
13+
from datachecks.data_checks.datacheck_meta.data_check_action_code import (
14+
DataCheckActionCode,
15+
)
16+
from datachecks.data_checks.checks.data_checks import DataChecks
17+
from datachecks.data_checks.datacheck_meta.data_check_message import (
18+
DataCheckMessage,
19+
DataCheckWarning,
20+
DataCheckError,
21+
)
22+
from datachecks.data_checks.datacheck_meta.data_check_message_type import (
23+
DataCheckMessageType,
24+
)
25+
from datachecks.data_checks.checks.id_columns_data_check import IDColumnsDataCheck
26+
27+
from datachecks.data_checks.datacheck_meta.utils import handle_data_check_action_code

0 commit comments

Comments
 (0)