|
| 1 | +# scikit-sos |
| 2 | + |
| 3 | +scikit-sos is a Python module for Stochastic Outlier Selection (SOS). It is compatible with scikit-learn. SOS is an unsupervised outlier selection algorithm. It uses the concept of affinity to compute an outlier probability for each data point. |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +For more information about SOS, see the technical report: J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. [Stochastic Outlier Selection](https://github.com/jeroenjanssens/sos/blob/master/doc/sos-ticc-tr-2012-001.pdf?raw=true). Technical Report TiCC TR 2012-001, Tilburg University, Tilburg, the Netherlands, 2012. |
| 8 | + |
| 9 | +## Install |
| 10 | + |
| 11 | +Using pip: |
| 12 | + |
| 13 | +```bash |
| 14 | +pip install scikit-sos |
| 15 | +``` |
| 16 | + |
| 17 | +Using uv (recommended for fast installation): |
| 18 | + |
| 19 | +```bash |
| 20 | +# Install uv if not already installed |
| 21 | +curl -LsSf https://astral.sh/uv/install.sh | sh |
| 22 | + |
| 23 | +# Install scikit-sos |
| 24 | +uv pip install scikit-sos |
| 25 | +``` |
| 26 | + |
| 27 | +## Development |
| 28 | + |
| 29 | +This project uses modern Python tooling: |
| 30 | + |
| 31 | +- **uv** for fast package management |
| 32 | +- **ruff** for linting and formatting |
| 33 | +- **mypy** for type checking |
| 34 | +- **pytest** for testing |
| 35 | + |
| 36 | +To set up a development environment: |
| 37 | + |
| 38 | +```bash |
| 39 | +# Clone repository |
| 40 | +git clone https://github.com/jeroenjanssens/scikit-sos.git |
| 41 | +cd scikit-sos |
| 42 | + |
| 43 | +# Create virtual environment and install with dev dependencies |
| 44 | +uv venv |
| 45 | +source .venv/bin/activate # On Windows: .venv\Scripts\activate |
| 46 | +uv pip install -e ".[dev]" |
| 47 | + |
| 48 | +# Install pre-commit hooks |
| 49 | +pre-commit install |
| 50 | +``` |
| 51 | + |
| 52 | +Run tests: |
| 53 | + |
| 54 | +```bash |
| 55 | +pytest |
| 56 | +``` |
| 57 | + |
| 58 | +Run linting: |
| 59 | + |
| 60 | +```bash |
| 61 | +ruff check . |
| 62 | +``` |
| 63 | + |
| 64 | +Run formatting: |
| 65 | + |
| 66 | +```bash |
| 67 | +ruff format . |
| 68 | +``` |
| 69 | + |
| 70 | +Run type checking: |
| 71 | + |
| 72 | +```bash |
| 73 | +mypy sksos |
| 74 | +``` |
| 75 | + |
| 76 | +### Type Hints |
| 77 | + |
| 78 | +This package includes full type hints for better IDE support: |
| 79 | + |
| 80 | +```python |
| 81 | +from sksos import SOS |
| 82 | +import numpy as np |
| 83 | +from numpy.typing import NDArray |
| 84 | + |
| 85 | +# Type hints work automatically |
| 86 | +detector: SOS = SOS(perplexity=20) |
| 87 | +data: NDArray = np.array([[1, 2], [3, 4]]) |
| 88 | +scores: NDArray = detector.predict(data) |
| 89 | +``` |
| 90 | + |
| 91 | +## Usage |
| 92 | + |
| 93 | +```python |
| 94 | +>>> import pandas as pd |
| 95 | +>>> from sksos import SOS |
| 96 | +>>> iris = pd.read_csv("http://bit.ly/iris-csv") |
| 97 | +>>> X = iris.drop("Name", axis=1).values |
| 98 | +>>> detector = SOS() |
| 99 | +>>> iris["score"] = detector.predict(X) |
| 100 | +>>> iris.sort_values("score", ascending=False).head(10) |
| 101 | + SepalLength SepalWidth PetalLength PetalWidth Name score |
| 102 | +41 4.5 2.3 1.3 0.3 Iris-setosa 0.981898 |
| 103 | +106 4.9 2.5 4.5 1.7 Iris-virginica 0.964381 |
| 104 | +22 4.6 3.6 1.0 0.2 Iris-setosa 0.957945 |
| 105 | +134 6.1 2.6 5.6 1.4 Iris-virginica 0.897970 |
| 106 | +24 4.8 3.4 1.9 0.2 Iris-setosa 0.871733 |
| 107 | +114 5.8 2.8 5.1 2.4 Iris-virginica 0.831610 |
| 108 | +62 6.0 2.2 4.0 1.0 Iris-versicolor 0.821141 |
| 109 | +108 6.7 2.5 5.8 1.8 Iris-virginica 0.819842 |
| 110 | +44 5.1 3.8 1.9 0.4 Iris-setosa 0.773301 |
| 111 | +100 6.3 3.3 6.0 2.5 Iris-virginica 0.765657 |
| 112 | +``` |
| 113 | + |
| 114 | +## Command Line Interface |
| 115 | + |
| 116 | +This module also includes a command-line tool called `sos`. To illustrate, we apply SOS with a perplexity of 10 to the Iris dataset: |
| 117 | + |
| 118 | +```bash |
| 119 | +$ curl -sL http://bit.ly/iris-csv | |
| 120 | +> tail -n +2 | cut -d, -f1-4 | |
| 121 | +> sos -p 10 | |
| 122 | +> sort -nr | head |
| 123 | +0.98189840 |
| 124 | +0.96438132 |
| 125 | +0.95794492 |
| 126 | +0.89797043 |
| 127 | +0.87173299 |
| 128 | +0.83161045 |
| 129 | +0.82114072 |
| 130 | +0.81984209 |
| 131 | +0.77330148 |
| 132 | +0.76565738 |
| 133 | +``` |
| 134 | + |
| 135 | +Adding a threshold causes SOS to output 0s and 1s instead of outlier probabilities. If we set the threshold to 0.8 then we see that out of the 150 data points, 8 are selected as outliers: |
| 136 | + |
| 137 | +```bash |
| 138 | +$ curl -sL http://bit.ly/iris-csv | |
| 139 | +> tail -n +2 | cut -d, -f1-4 | |
| 140 | +> sos -p 10 -t 0.8 | |
| 141 | +> paste -sd+ | bc |
| 142 | +8 |
| 143 | +``` |
| 144 | + |
| 145 | +## License |
| 146 | + |
| 147 | +All software in this repository is distributed under the terms of the BSD Simplified License. The full license is in the LICENSE file. |
0 commit comments