|
1 | 1 | Outliers |
2 | 2 | ======== |
3 | 3 |
|
4 | | -Simple outlier detection by comparing distances between instances. |
| 4 | +Outlier detection widget. |
5 | 5 |
|
6 | 6 | **Inputs** |
7 | 7 |
|
8 | 8 | - Data: input dataset |
9 | | -- Distances: distance matrix |
10 | 9 |
|
11 | 10 | **Outputs** |
12 | 11 |
|
13 | 12 | - Outliers: instances scored as outliers |
14 | 13 | - Inliers: instances not scored as outliers |
| 14 | +- Data: input dataset appended *Outlier* variable |
15 | 15 |
|
16 | | -The **Outliers** widget applies one of the two methods for outlier detection. Both methods apply classification to the dataset, one with SVM (multiple kernels) and the other with elliptical envelope. *One-class SVM with non-linear kernels (RBF)* performs well with non-Gaussian distributions, while *Covariance estimator* works only for data with Gaussian distribution. |
| 16 | +The **Outliers** widget applies one of the four methods for outlier detection. All methods apply classification to the dataset. *One-class SVM with non-linear kernels (RBF)* performs well with non-Gaussian distributions, while *Covariance estimator* works only for data with Gaussian distribution. One efficient way to perform outlier detection on moderately high dimensional datasets is to use the *Local Outlier Factor* algorithm. The algorithm computes a score reflecting the degree of abnormality of the observations. It measures the local density deviation of a given data point with respect to its neighbors. Another efficient way of performing outlier detection in high-dimensional datasets is to use random forests (*Isolation Forest*). |
17 | 17 |
|
18 | 18 |  |
19 | 19 |
|
20 | | -1. Information on the input data, number of inliers and outliers based on the selected model. |
21 | | -2. Select the *Outlier detection method*: |
| 20 | +1. Method for outlier detection: |
| 21 | + - [One Class SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) |
| 22 | + - [Covariance Estimator](http://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html) |
| 23 | + - [Local Outlier Factor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html) |
| 24 | + - [Isolation Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) |
| 25 | +2. Set parameters for the method: |
22 | 26 | - **One class SVM with non-linear kernel (RBF)**: classifies data as similar or different from the core class: |
23 | | - - **Nu** is a parameter for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors |
24 | | - - **Kernel coefficient** is a gamma parameter, which specifies how much influence a single data instance has |
25 | | - - **Covariance estimator**: fits ellipsis to central points with Mahalanobis distance metric |
26 | | - - **Contamination** is the proportion of outliers in the dataset |
27 | | - - **Support fraction** specifies the proportion of points included in the estimate |
28 | | -3. Produce a report. |
29 | | -4. Click *Detect outliers* to output the data. |
| 27 | + - *Nu* is a parameter for the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors |
| 28 | + - *Kernel coefficient* is a gamma parameter, which specifies how much influence a single data instance has |
| 29 | + - **Covariance estimator**: fits ellipsis to central points with Mahalanobis distance metric: |
| 30 | + - *Contamination* is the proportion of outliers in the dataset |
| 31 | + - *Support fraction* specifies the proportion of points included in the estimate |
| 32 | + - **Local Outlier Factor**: obtains local density from the k-nearest neighbors: |
| 33 | + - *Contamination* is the proportion of outliers in the dataset |
| 34 | + - *Neighbors* represents number of neighbors |
| 35 | + - *Metric* is the distance measure |
| 36 | + - **Isolation Forest**: isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature: |
| 37 | + - *Contamination* is the proportion of outliers in the dataset |
| 38 | + - *Replicabe training* fixes random seed |
| 39 | +3. If *Apply automatically* is ticked, changes will be propagated automatically. Alternatively, click *Apply*. |
| 40 | +4. Produce a report. |
| 41 | +5. Number of instances on the input, followed by number of instances scored as inliers. |
| 42 | + |
30 | 43 |
|
31 | 44 | Example |
32 | 45 | ------- |
33 | 46 |
|
34 | | -Below, is a simple example of how to use this widget. We used the *Iris* dataset to detect the outliers. We chose the *one class SVM with non-linear kernel (RBF)* method, with Nu set at 20% (less training errors, more support vectors). Then we observed the outliers in the [Data Table](../data/datatable.md) widget, while we sent the inliers to the [Scatter Plot](../visualize/scatterplot.md). |
| 47 | +Below is an example of how to use this widget. We used subset (*versicolor* and *virginica* instances) of the *Iris* dataset to detect the outliers. We chose the *Local Outlier Factor* method, with *Euclidean* distance. Then we observed the annotated instances in the [Scatter Plot](../visualize/scatterplot.md) widget. In the next step we used the *setosa* instances to demonstrate novelty detection using [Apply Domain](../data/applydomain.md) widget. After concatenating both outputs we examined the outliers in the *Scatter Plot (1)*. |
35 | 48 |
|
36 | 49 |  |
0 commit comments