Skip to content

Commit 4241814

Browse files
authored
Merge pull request #535 from yzhao062/development
Development V1.12
2 parents bb53fdc + a7b44f2 commit 4241814

File tree

18 files changed

+650
-586
lines changed

18 files changed

+650
-586
lines changed

CHANGES.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,4 +179,7 @@ v<1.0.8>, <03/08/2023> -- Optimized ECDF and drop Statsmodels dependency (#467).
179179
v<1.0.9>, <03/19/2023> -- Hot fix for errors in ECOD and COPOD due to the issue of scipy.
180180
v<1.1.0>, <06/19/2023> -- Further integration of PyThresh.
181181
v<1.1.1>, <07/03/2023> -- Bump up sklearn requirement and some hot fixes.
182-
v<1.1.1>, <10/24/2023> -- Add deep isolation forest (#506)
182+
v<1.1.1>, <10/24/2023> -- Add deep isolation forest (#506).
183+
v<1.1.2>, <11/17/2023> -- Massive documentation optimization.
184+
v<1.1.2>, <11/17/2023> -- Fix the issue of contamination.
185+
v<1.1.2>, <11/17/2023> -- KPCA bug fix (#494).

README.rst

Lines changed: 70 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -58,20 +58,35 @@ Python Outlier Detection (PyOD)
5858

5959
-----
6060

61-
**News**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_.
62-
The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
6361

64-
**For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.
65-
**For graph outlier detection**, please use `PyGOD <https://pygod.org/>`_.
62+
Read Me First
63+
^^^^^^^^^^^^^
64+
65+
Welcome to PyOD, a versatile Python library for detecting anomalies in multivariate data. Whether you're tackling a small-scale project or large datasets, PyOD offers a range of algorithms to suit your needs.
66+
67+
* **For time-series outlier detection**, please use `TODS <https://github.com/datamllab/tods>`_.
68+
69+
* **For graph outlier detection**, please use `PyGOD <https://pygod.org/>`_.
70+
71+
* **Performance Comparison \& Datasets**: We have a 45-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_. The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
72+
73+
* **Learn more about anomaly detection** \@ `Anomaly Detection Resources <https://github.com/yzhao062/anomaly-detection-resources>`_
74+
75+
* **PyOD on Distributed Systems**: you could also run `PyOD on databricks <https://www.databricks.com/blog/2023/03/13/unsupervised-outlier-detection-databricks.html>`_.
76+
77+
----
78+
79+
About PyOD
80+
^^^^^^^^^^
6681

67-
PyOD is the most comprehensive and scalable **Python library** for **detecting outlying objects** in
82+
PyOD, established in 2017, has become a go-to **Python library** for **detecting anomalous/outlying objects** in
6883
multivariate data. This exciting yet challenging field is commonly referred as
6984
`Outlier Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_
7085
or `Anomaly Detection <https://en.wikipedia.org/wiki/Anomaly_detection>`_.
7186

72-
PyOD includes more than 40 detection algorithms, from classical LOF (SIGMOD 2000) to
73-
the latest ECOD and DIF (TKDE 2022 and 2023). Since 2017, PyOD has been successfully used in numerous academic researches and
74-
commercial products with more than `10 million downloads <https://pepy.tech/project/pyod>`_.
87+
PyOD includes more than 50 detection algorithms, from classical LOF (SIGMOD 2000) to
88+
the cutting-edge ECOD and DIF (TKDE 2022 and 2023). Since 2017, PyOD has been successfully used in numerous academic researches and
89+
commercial products with more than `17 million downloads <https://pepy.tech/project/pyod>`_.
7590
It is also well acknowledged by the machine learning community with various dedicated posts/tutorials, including
7691
`Analytics Vidhya <https://www.analyticsvidhya.com/blog/2019/02/outlier-detection-python-pyod/>`_,
7792
`KDnuggets <https://www.kdnuggets.com/2019/02/outlier-detection-methods-cheat-sheet.html>`_, and
@@ -80,10 +95,10 @@ It is also well acknowledged by the machine learning community with various dedi
8095

8196
**PyOD is featured for**:
8297

83-
* **Unified APIs, detailed documentation, and interactive examples** across various algorithms.
84-
* **Advanced models**\, including **classical distance and density estimation**, **latest deep learning methods**, and **emerging algorithms like ECOD**.
85-
* **Optimized performance with JIT and parallelization** using `numba <https://github.com/numba/numba>`_ and `joblib <https://github.com/joblib/joblib>`_.
86-
* **Fast training & prediction with SUOD** [#Zhao2021SUOD]_.
98+
* **Unified, User-Friendly Interface** across various algorithms.
99+
* **Wide Range of Models**\, from classic techniques to the latest deep learning methods.
100+
* **High Performance & Efficiency**, leveraging `numba <https://github.com/numba/numba>`_ and `joblib <https://github.com/joblib/joblib>`_ for JIT compilation and parallel processing.
101+
* **Fast Training & Prediction**, achieved through the SUOD framework [#Zhao2021SUOD]_.
87102

88103

89104
**Outlier Detection with 5 Lines of Code**\ :
@@ -92,22 +107,19 @@ It is also well acknowledged by the machine learning community with various dedi
92107
.. code-block:: python
93108
94109
95-
# train an ECOD detector
110+
# Example: Training an ECOD detector
96111
from pyod.models.ecod import ECOD
97112
clf = ECOD()
98113
clf.fit(X_train)
114+
y_train_scores = clf.decision_scores_ # Outlier scores for training data
115+
y_test_scores = clf.decision_function(X_test) # Outlier scores for test data
99116
100-
# get outlier scores
101-
y_train_scores = clf.decision_scores_ # raw outlier scores on the train data
102-
y_test_scores = clf.decision_function(X_test) # predict raw outlier scores on test
103-
104-
105-
**Personal suggestion on selecting an OD algorithm**. If you do not know which algorithm to try, go with:
117+
**Selecting the Right Algorithm:**. Unsure where to start? Consider these robust and interpretable options:
106118

107119
- `ECOD <https://github.com/yzhao062/pyod/blob/master/examples/ecod_example.py>`_: Example of using ECOD for outlier detection
108120
- `Isolation Forest <https://github.com/yzhao062/pyod/blob/master/examples/iforest_example.py>`_: Example of using Isolation Forest for outlier detection
109121

110-
They are both fast and interpretable. Or, you could try more data-driven approach `MetaOD <https://github.com/yzhao062/MetaOD>`_.
122+
Alternatively, explore `MetaOD <https://github.com/yzhao062/MetaOD>`_ for a data-driven approach.
111123

112124
**Citing PyOD**\ :
113125

@@ -131,29 +143,34 @@ or::
131143

132144
Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of machine learning research (JMLR), 20(96), pp.1-7.
133145

134-
If you want more general insights of anomaly detection and/or algorithm performance comparison, please see our
135-
NeurIPS 2022 paper `ADBench: Anomaly Detection Benchmark Paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-neurips-adbench.pdf>`_::
146+
For a broader perspective on anomaly detection, see our NeurIPS papers
147+
`ADBench: Anomaly Detection Benchmark Paper <https://viterbi-web.usc.edu/~yzhao010/papers/22-neurips-adbench.pdf>`_ \& `ADGym: Design Choices for Deep Anomaly Detection <https://viterbi-web.usc.edu/~yzhao010/papers/23-neurips-adgym.pdf>`_::
136148

137-
@inproceedings{han2022adbench,
138-
title={ADBench: Anomaly Detection Benchmark},
139-
author={Songqiao Han and Xiyang Hu and Hailiang Huang and Mingqi Jiang and Yue Zhao},
140-
booktitle={Neural Information Processing Systems (NeurIPS)}
141-
year={2022},
149+
@article{han2022adbench,
150+
title={Adbench: Anomaly detection benchmark},
151+
author={Han, Songqiao and Hu, Xiyang and Huang, Hailiang and Jiang, Minqi and Zhao, Yue},
152+
journal={Advances in Neural Information Processing Systems},
153+
volume={35},
154+
pages={32142--32159},
155+
year={2022}
142156
}
143157

144-
**Key Links and Resources**\ :
145-
158+
@article{jiang2023adgym,
159+
title={ADGym: Design Choices for Deep Anomaly Detection},
160+
author={Jiang, Minqi and Hou, Chaochuan and Zheng, Ao and Han, Songqiao and Huang, Hailiang and Wen, Qingsong and Hu, Xiyang and Zhao, Yue},
161+
journal={Advances in Neural Information Processing Systems},
162+
volume={36},
163+
year={2023}
164+
}
146165

147-
* `View the latest codes on Github <https://github.com/yzhao062/pyod>`_
148-
* `Anomaly Detection Resources <https://github.com/yzhao062/anomaly-detection-resources>`_
149166

150167

151168
**Table of Contents**\ :
152169

153170

154171
* `Installation <#installation>`_
155172
* `API Cheatsheet & Reference <#api-cheatsheet--reference>`_
156-
* `ADBench Benchmark <#adbench-benchmark>`_
173+
* `ADBench Benchmark and Datasets <#adbench-benchmark-and-datasets>`_
157174
* `Model Save & Load <#model-save--load>`_
158175
* `Fast Train with SUOD <#fast-train-with-suod>`_
159176
* `Thresholding Outlier Scores <#thresholding-outlier-scores>`_
@@ -169,8 +186,8 @@ NeurIPS 2022 paper `ADBench: Anomaly Detection Benchmark Paper <https://www.andr
169186
Installation
170187
^^^^^^^^^^^^
171188

172-
It is recommended to use **pip** or **conda** for installation. Please make sure
173-
**the latest version** is installed, as PyOD is updated frequently:
189+
PyOD is designed for easy installation using either **pip** or **conda**.
190+
We recommend using the latest version of PyOD due to frequent updates and enhancements:
174191

175192
.. code-block:: bash
176193
@@ -193,7 +210,7 @@ Alternatively, you could clone and run setup.py file:
193210
**Required Dependencies**\ :
194211

195212

196-
* Python 3.6+
213+
* Python 3.6 or higher
197214
* joblib
198215
* matplotlib
199216
* numpy>=1.19
@@ -207,19 +224,12 @@ Alternatively, you could clone and run setup.py file:
207224

208225
* combo (optional, required for models/combination.py and FeatureBagging)
209226
* keras/tensorflow (optional, required for AutoEncoder, and other deep learning models)
210-
* pandas (optional, required for running benchmark)
211227
* suod (optional, required for running SUOD model)
212228
* xgboost (optional, required for XGBOD)
213-
* pythresh to use thresholding
229+
* pythresh (optional, required for thresholding)
214230

215231
**Warning**\ :
216-
PyOD has multiple neural network based models, e.g., AutoEncoders, which are
217-
implemented in both Tensorflow and PyTorch. However, PyOD does **NOT** install these deep learning libraries for you.
218-
This reduces the risk of interfering with your local copies.
219-
If you want to use neural-net based models, please make sure these deep learning libraries are installed.
220-
Instructions are provided: `neural-net FAQ <https://github.com/yzhao062/pyod/wiki/Setting-up-Keras-and-Tensorflow-for-Neural-net-Based-models>`_.
221-
Similarly, models depending on **xgboost**, e.g., XGBOD, would **NOT** enforce xgboost installation by default.
222-
232+
PyOD includes several neural network-based models, such as AutoEncoders, implemented in Tensorflow and PyTorch. These deep learning libraries are not automatically installed by PyOD to avoid conflicts with existing installations. If you plan to use neural-net based models, please ensure these libraries are installed. See the `neural-net FAQ <https://github.com/yzhao062/pyod/wiki/Setting-up-Keras-and-Tensorflow-for-Neural-net-Based-models>`_ for guidance. Additionally, xgboost is not installed by default but is required for models like XGBOD.
223233

224234

225235
----
@@ -228,29 +238,27 @@ Similarly, models depending on **xgboost**, e.g., XGBOD, would **NOT** enforce x
228238
API Cheatsheet & Reference
229239
^^^^^^^^^^^^^^^^^^^^^^^^^^
230240

231-
Full API Reference: (https://pyod.readthedocs.io/en/latest/pyod.html). API cheatsheet for all detectors:
232-
241+
The full API Reference is available at `PyOD Documentation <https://pyod.readthedocs.io/en/latest/pyod.html>`_. Below is a quick cheatsheet for all detectors:
233242

234-
* **fit(X)**\ : Fit detector. y is ignored in unsupervised methods.
235-
* **decision_function(X)**\ : Predict raw anomaly score of X using the fitted detector.
236-
* **predict(X)**\ : Predict if a particular sample is an outlier or not using the fitted detector.
237-
* **predict_proba(X)**\ : Predict the probability of a sample being outlier using the fitted detector.
238-
* **predict_confidence(X)**\ : Predict the model's sample-wise confidence (available in predict and predict_proba) [#Perini2020Quantifying]_.
243+
* **fit(X)**\ : Fit the detector. The parameter y is ignored in unsupervised methods.
244+
* **decision_function(X)**\ : Predict raw anomaly scores for X using the fitted detector.
245+
* **predict(X)**\ : Determine whether a sample is an outlier or not as binary labels using the fitted detector.
246+
* **predict_proba(X)**\ : Estimate the probability of a sample being an outlier using the fitted detector.
247+
* **predict_confidence(X)**\ : Assess the model's confidence on a per-sample basis (applicable in predict and predict_proba) [#Perini2020Quantifying]_.
239248

240249

241-
Key Attributes of a fitted model:
250+
**Key Attributes of a fitted model**:
242251

243252

244-
* **decision_scores_**\ : The outlier scores of the training data. The higher, the more abnormal.
245-
Outliers tend to have higher scores.
246-
* **labels_**\ : The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies.
253+
* **decision_scores_**\ : Outlier scores of the training data. Higher scores typically indicate more abnormal behavior. Outliers usually have higher scores.
254+
* **labels_**\ : Binary labels of the training data, where 0 indicates inliers and 1 indicates outliers/anomalies.
247255

248256

249257
----
250258

251259

252-
ADBench Benchmark
253-
^^^^^^^^^^^^^^^^^
260+
ADBench Benchmark and Datasets
261+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
254262

255263
We just released a 45-page, the most comprehensive `ADBench: Anomaly Detection Benchmark <https://arxiv.org/abs/2206.09426>`_ [#Han2022ADBench]_.
256264
The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 57 benchmark datasets.
@@ -262,16 +270,12 @@ The organization of **ADBench** is provided below:
262270
:alt: benchmark-fig
263271

264272

265-
**The comparison of selected models** is made available below
266-
(\ `Figure <https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png>`_\ ,
267-
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\ ,
268-
`Interactive Jupyter Notebooks <https://mybinder.org/v2/gh/yzhao062/pyod/master>`_\ ).
269-
For Jupyter Notebooks, please navigate to **"/notebooks/Compare All Models.ipynb"**.
270-
273+
For a simpler visualization, we make **the comparison of selected models** via
274+
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.
271275

272-
.. image:: https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png
273-
:target: https://raw.githubusercontent.com/yzhao062/pyod/master/examples/ALL.png
274-
:alt: Comparision_of_All
276+
.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
277+
:target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
278+
:alt: Comparison_of_All
275279

276280

277281

Threshold.rst

Whitespace-only changes.

docs/api_cc.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,20 @@
11
API CheatSheet
22
==============
33

4-
The following APIs are applicable for all detector models for easy use.
4+
The full API Reference is available at `PyOD Documentation <https://pyod.readthedocs.io/en/latest/pyod.html>`_. Below is a quick cheatsheet for all detectors:
55

6-
* :func:`pyod.models.base.BaseDetector.fit`: Fit detector. y is ignored in unsupervised methods.
7-
* :func:`pyod.models.base.BaseDetector.decision_function`: Predict raw anomaly score of X using the fitted detector.
8-
* :func:`pyod.models.base.BaseDetector.predict`: Predict if a particular sample is an outlier or not using the fitted detector.
9-
* :func:`pyod.models.base.BaseDetector.predict_proba`: Predict the probability of a sample being outlier using the fitted detector.
10-
* :func:`pyod.models.base.BaseDetector.predict_confidence`: Predict the model's sample-wise confidence (available in predict and predict_proba).
6+
* :func:`pyod.models.base.BaseDetector.fit`: The parameter y is ignored in unsupervised methods.
7+
* :func:`pyod.models.base.BaseDetector.decision_function`: Predict raw anomaly scores for X using the fitted detector.
8+
* :func:`pyod.models.base.BaseDetector.predict`: Determine whether a sample is an outlier or not as binary labels using the fitted detector.
9+
* :func:`pyod.models.base.BaseDetector.predict_proba`: Estimate the probability of a sample being an outlier using the fitted detector.
10+
* :func:`pyod.models.base.BaseDetector.predict_confidence`: Assess the model's confidence on a per-sample basis (applicable in predict and predict_proba) [#Perini2020Quantifying]_.
1111

1212

13-
Key Attributes of a fitted model:
13+
**Key Attributes of a fitted model**:
1414

15-
* :attr:`pyod.models.base.BaseDetector.decision_scores_`: The outlier scores of the training data. The higher, the more abnormal.
15+
* :attr:`pyod.models.base.BaseDetector.decision_scores_`: Outlier scores of the training data. Higher scores typically indicate more abnormal behavior. Outliers usually have higher scores.
1616
Outliers tend to have higher scores.
17-
* :attr:`pyod.models.base.BaseDetector.labels_`: The binary labels of the training data. 0 stands for inliers and 1 for outliers/anomalies.
17+
* :attr:`pyod.models.base.BaseDetector.labels_`: Binary labels of the training data, where 0 indicates inliers and 1 indicates outliers/anomalies.
1818

1919

2020
See base class definition below:

docs/benchmark.rst

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Benchmarks
44
Latest ADBench (2022)
55
---------------------
66

7-
We just released a 36-page, the most comprehensive `anomaly detection benchmark paper <https://www.andrew.cmu.edu/user/yuezhao2/papers/22-preprint-adbench.pdf>`_ :cite:`a-han2022adbench`.
7+
We just released a 36-page, the most comprehensive `anomaly detection benchmark paper <https://arxiv.org/abs/2206.09426>`_ :cite:`a-han2022adbench`.
88
The fully `open-sourced ADBench <https://github.com/Minqi824/ADBench>`_ compares 30 anomaly detection algorithms on 55 benchmark datasets.
99

1010
The organization of **ADBench** is provided below:
@@ -14,6 +14,13 @@ The organization of **ADBench** is provided below:
1414
:alt: benchmark
1515

1616

17+
For a simpler visualization, we make **the comparison of selected models** via
18+
`compare_all_models.py <https://github.com/yzhao062/pyod/blob/master/examples/compare_all_models.py>`_\.
19+
20+
.. image:: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
21+
:target: https://github.com/yzhao062/pyod/blob/development/examples/ALL.png?raw=true
22+
:alt: Comparison_of_All
23+
1724
Old Results (2019)
1825
------------------
1926

0 commit comments

Comments
 (0)