Merge pull request #4374 from VesnaT/outliers_docs

ajdapretnar · web-flow · commit b5be01df8f99 · 2020-02-11T09:35:37.000+01:00
[DOC] outlier_detection: data-mining-library docs
diff --git a/Orange/widgets/data/owoutliers.py b/Orange/widgets/data/owoutliers.py
@@ -190,7 +190,7 @@ def init_gui(self):
 
         self._init_editors()
 
-        gui.auto_send(self.controlArea, self, "auto_commit")
+        gui.auto_apply(self.controlArea, self, "auto_commit")
 
         self.info.set_input_summary(self.info.NoInput)
         self.info.set_output_summary(self.info.NoOutput)
diff --git a/doc/data-mining-library/source/index.rst b/doc/data-mining-library/source/index.rst
@@ -35,6 +35,7 @@ Available classes and methods.
 
    reference/data
    reference/preprocess
+   reference/outliers
    reference/classification
    reference/regression
    reference/clustering
diff --git a/doc/data-mining-library/source/reference/classification.rst b/doc/data-mining-library/source/reference/classification.rst
@@ -109,18 +109,6 @@ Nu-Support Vector Machines
    :members:
 
 
-
-.. index:: one class SVM
-   pair: classification; one class SVM
-
-One Class Support Vector Machines
----------------------------------
-
-.. autoclass:: OneClassSVMLearner
-   :members:
-
-
-
 .. index:: classification tree
    pair: classification; tree
 
@@ -158,16 +146,6 @@ Majority Classifier
    :members:
 
 
-.. index:: elliptic envelope
-   pair: classification; elliptic envelope
-
-Elliptic Envelope
------------------
-
-.. autoclass:: EllipticEnvelopeLearner
-   :members:
-
-
 .. index:: neural network
    pair: classification; neural network
 
diff --git a/doc/data-mining-library/source/reference/outliers.rst b/doc/data-mining-library/source/reference/outliers.rst
@@ -0,0 +1,45 @@
+#########################################
+Outlier detection (``classification``)
+#########################################
+
+.. automodule:: Orange.classification
+
+.. index:: one class SVM
+   pair: classification; one class SVM
+
+One Class Support Vector Machines
+---------------------------------
+
+.. autoclass:: OneClassSVMLearner
+   :members:
+
+
+
+.. index:: elliptic envelope
+   pair: classification; elliptic envelope
+
+Elliptic Envelope
+-----------------
+
+.. autoclass:: EllipticEnvelopeLearner
+   :members:
+
+
+.. index:: local outlier factor
+   pair: classification; local outlier factor
+
+Local Outlier Factor
+--------------------
+
+.. autoclass:: LocalOutlierFactorLearner
+   :members:
+
+
+.. index:: isolation forest
+   pair: classification; isolation forest
+
+Isolation Forest
+----------------
+
+.. autoclass:: IsolationForestLearner
+   :members:
diff --git a/doc/visual-programming/source/widgets/data/images/Outliers-Example.png b/doc/visual-programming/source/widgets/data/images/Outliers-Example.png
diff --git a/doc/visual-programming/source/widgets/data/images/Outliers-stamped.png b/doc/visual-programming/source/widgets/data/images/Outliers-stamped.png
diff --git a/doc/visual-programming/source/widgets/data/outliers.md b/doc/visual-programming/source/widgets/data/outliers.md
@@ -1,36 +1,49 @@
 Outliers
 ========
 
-Simple outlier detection by comparing distances between instances.
+Outlier detection widget.
 
 **Inputs**
 
 - Data: input dataset
-- Distances: distance matrix
 
 **Outputs**
 
 - Outliers: instances scored as outliers
 - Inliers: instances not scored as outliers
+- Data: input dataset appended *Outlier* variable
 
-The **Outliers** widget applies one of the two methods for outlier detection. Both methods apply classification to the dataset, one with SVM (multiple kernels) and the other with elliptical envelope. *One-class SVM with non-linear kernels (RBF)* performs well with non-Gaussian distributions, while *Covariance estimator* works only for data with Gaussian distribution.
+The **Outliers** widget applies one of the four methods for outlier detection. All methods apply classification to the dataset. *One-class SVM with non-linear kernels (RBF)* performs well with non-Gaussian distributions, while *Covariance estimator* works only for data with Gaussian distribution. One efficient way to perform outlier detection on moderately high dimensional datasets is to use the *Local Outlier Factor* algorithm. The algorithm computes a score reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with respect to its neighbors. Another efficient way of performing outlier detection in high-dimensional datasets is to use random forests (*Isolation Forest*).
 
 ![](images/Outliers-stamped.png)
 
-1. Information on the input data, number of inliers and outliers based on the selected model.
-2. Select the *Outlier detection method*:
+1. Method for outlier detection:
+   - [One Class SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html)
+   - [Covariance Estimator](http://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html)
+   - [Local Outlier Factor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html)
+   - [Isolation Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)
+2. Set parameters for the method:
    - **One class SVM with non-linear kernel (RBF)**: classifies data as similar or different from the core class:
-      - **Nu** is a parameter for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors
-      - **Kernel coefficient** is a gamma parameter, which specifies how much influence a single data instance has
-   - **Covariance estimator**: fits ellipsis to central points with Mahalanobis distance metric
-      - **Contamination** is the proportion of outliers in the dataset
-      - **Support fraction** specifies the proportion of points included in the estimate
-3. Produce a report.
-4. Click *Detect outliers* to output the data.
+      - *Nu* is a parameter for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors
+      - *Kernel coefficient* is a gamma parameter, which specifies how much influence a single data instance has
+    - **Covariance estimator**: fits ellipsis to central points with Mahalanobis distance metric:
+      - *Contamination* is the proportion of outliers in the dataset
+      - *Support fraction* specifies the proportion of points included in the estimate
+   - **Local Outlier Factor**: obtains local density from the k-nearest neighbors:
+      - *Contamination* is the proportion of outliers in the dataset
+      - *Neighbors* represents number of neighbors
+      - *Metric* is the distance measure
+   - **Isolation Forest**: isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature:
+     - *Contamination* is the proportion of outliers in the dataset
+     - *Replicabe training* fixes random seed
+3. If *Apply automatically* is ticked, changes will be propagated automatically. Alternatively, click *Apply*.
+4. Produce a report.
+5. Number of instances on the input, followed by number of instances scored as inliers.
+
 
 Example
 -------
 
-Below, is a simple example of how to use this widget. We used the *Iris* dataset to detect the outliers. We chose the *one class SVM with non-linear kernel (RBF)* method, with Nu set at 20% (less training errors, more support vectors). Then we observed the outliers in the [Data Table](../data/datatable.md) widget, while we sent the inliers to the [Scatter Plot](../visualize/scatterplot.md).
+Below is an example of how to use this widget. We used subset (*versicolor* and *virginica* instances) of the *Iris* dataset to detect the outliers. We chose the *Local Outlier Factor* method, with *Euclidean* distance. Then we observed the annotated instances in the [Scatter Plot](../visualize/scatterplot.md) widget. In the next step we used the *setosa* instances to demonstrate novelty detection using [Apply Domain](../data/applydomain.md) widget. After concatenating both outputs we examined the outliers in the *Scatter Plot (1)*.
 
 ![](images/Outliers-Example.png)