Skip to content

Conversation

@Powerscore
Copy link

@Powerscore Powerscore commented Jan 3, 2026

Summary

This PR adds dimensional explainability to the Local Outlier Factor (LOF) detector. It implements the same interpretability API pattern proposed for KNN (PR #652), providing consistent explain_outlier() visualization and get_outlier_explainability_scores() programmatic access across PyOD's core detectors.

Motivation

LOF is a density-based algorithm that excels at identifying local outliers. However, a single global LOF score does not indicate which subspace or feature exhibits the density contrast responsible for the anomaly. This PR addresses that gap by:

  1. Consistency: Implementing a method that evaluates 1D density using the original k-nearest neighbors from the full-dimensional space (ensuring the explanation explains the actual anomaly detected).
  2. Visualization: Providing horizontal bar charts of 1D LOF scores with statistical significance bands.
  3. Access: Enabling programmatic access to dimensional scores via get_outlier_explainability_scores() (completing the interface started in COPOD/KNN).

Changes Made

Core Implementation (pyod/models/lof.py)

  1. Store Training Data

    • Added self.X_train_ = X to enable subspace density calculations.
    • Follows the pattern established in COPOD and KNN (PR Add dimensional explainability to KNN detector #652).
    • Trade-off: Increases memory usage (O(N×D)) but is strictly necessary for dimensional density estimation.
  2. Lazy Neighbor Caching (_ensure_overall_neighbors)

    • Unlike KNN, sklearn's LOF does not always expose the graph easily. This method lazily computes and caches the global k-NN graph upon the first request for an explanation.
  3. Vectorized Subspace Calculation (_compute_lof_subspace_with_neighbors)

    • Implements the 1D LOF formula using the global neighbor set.
    • Uses a fully vectorized approach (numpy array operations) rather than loops to ensure performance.
    • Includes a caching strategy (_cached_1d_k_distances, _cached_1d_lof_scores) to prevent re-computing distances for the same dimension multiple times.
  4. Main Methods

    • explain_outlier(): Visualization with statistical cutoffs (percentiles across training data).
    • get_outlier_explainability_scores(): Returns the raw 1D LOF scores for specific dimensions.
  5. Added Imports (Lines ~7-9)

    • import warnings
    • import matplotlib.pyplot as plt as plt
    • import numpy as np

Example (examples/lof_interpretability.py)

Created a clean example using cardio.mat that mirrors the KNN example:

  • Demonstrates basic usage on high-dimensional data.
  • Shows custom cutoff bands.
  • Explains the difference between global scores and dimensional contributions.

Tests (pyod/test/test_lof.py)

Added test_get_outlier_explainability_scores which validates the math on a synthetic 2D dataset where outliers are obvious in specific dimensions (e.g., verifying that an X-axis outlier has a higher X-dimension LOF score).

API Design

The API mirrors the explain_outlier() interface established in COPOD and your recent KNN submission:

Feature KNN (PR #652) LOF (This PR)
Method name explain_outlier() explain_outlier()
Parameters ind, columns, cutoffs, feature_names... Same
Metric Avg Distance to k-NN 1D LOF Score (using global neighbors)
Programmatic get_outlier_explainability_scores() get_outlier_explainability_scores()

Usage Example:

from pyod.models.lof import LOF
from pyod.utils.data import generate_data

X_train, _, _, _ = generate_data(n_train=200, n_features=5)
clf = LOF(n_neighbors=20)
clf.fit(X_train)

# Visualize explanation
clf.explain_outlier(ind=0, feature_names=['F1', 'F2', 'F3', 'F4', 'F5'])

# Get raw scores
scores = clf.get_outlier_explainability_scores(ind=0)

Technical Details

Algorithm:

To explain an outlier $p$ in dimension $d$:

  1. Retrieve the set of $k$-nearest neighbors $\mathcal{N}_k(p)$ found in the full feature space.
  2. Calculate 1D Reachability Distance in dimension $d$ using these fixed neighbors:
    $$reach\text{-}dist_k^{(d)}(p, o) = \max( \text{k-distance}^{(d)}(o), |p_d - o_d| )$$
  3. Compute 1D Local Reachability Density (LRD) and the resulting 1D LOF score.

Complexity:

  • Space: O(N×D) to store training data.
  • Time:
    • First call (with cutoffs): O(N×k×D) to compute statistical bands across the full training set.
    • Subsequent calls: O(k×D) per explanation. Results are cached (self._cached_1d_lof_scores), making interactive exploration nearly instant after the initial computation.
  • Safety: Includes a ResourceWarning if computing cutoffs on very large datasets (>1GB estimated memory) suggests disabling compute_cutoffs.

Testing

  • Unit Tests: Added test_get_outlier_explainability_scores in pyod/test/test_lof.py.
  • Manual Validation: Tested against synthetic 2D cases (Inlier, X-Outlier, Y-Outlier, Sparse Cluster Inlier).
  • Quantitative Evaluation: Performed a perturbation test on the Pima Indians Diabetes dataset (see below).

Quantitative Evaluation (Perturbation Test)

To validate that the features identified by this method are truly responsible for the anomaly, we conducted a perturbation test on the top 20 outliers of the Pima dataset.

  • Method: We identified the top feature via 1D LOF, removed it, and re-calculated the score. We compared this to removing a random feature.
  • Result: Removing the explained feature reduced the LOF score by an average of 0.5737, while removing a random feature reduced it by only 0.0433.
  • Significance: This difference is statistically significant ($p < 0.001$), confirming the explanation method correctly identifies causal features.

Research Foundation

This implementation is based on the framework proposed in:

Krenmayr, Lucas and Goldstein, Markus (2023). "Explainable Outlier Detection Using Feature Ranking for k-Nearest Neighbors, Gaussian Mixture Model and Autoencoders." In 15th International Conference on Agents and Artificial Intelligence (ICAART).

BibTeX:

@inproceedings{Lucas2023xodknn,
  author = {Krenmayr, Lucas and Goldstein, Markus},
  year = {2023},
  month = {02},
  pages = {245-253},
  title = {Explainable Outlier Detection Using Feature Ranking for k-Nearest Neighbors, Gaussian Mixture Model and Autoencoders},
  doi = {10.5220/0011631900003411}
}

Screenshots/Examples

2D Validation Examples

Figure 1: 2D LOF Inlier
2d_xlof_inlier

Standard inlier in a dense cluster.

Figure 2: 2D LOF X-Dimension Outlier
2d_xlof_xoutlier

Point is an outlier in X (density contrast) but normal in Y.

Figure 3: 2D LOF Y-Dimension Outlier
2d_xlof_youtlier

Point is an outlier in Y (density contrast) but normal in X.

Figure 4: 2D LOF Outlier
2d_xlof_outlier

Standard outlier to a dense cluster.

Figure 5: 2D LOF Sparse Inlier
2d_xlof_sparse_inlier

A critical case for LOF: This point is far from others (KNN outlier) but fits the density of its local sparse cluster (LOF Inlier). The explanation correctly identifies low scores.

Real-World Dataset (Pima Indians Diabetes)

Note: We performed Min-Max Scaling on the dataset prior to generating these examples.

Figure 6: Pima LOF Outlier 1
pima_lof_outlier1
Top outlier driven by specific density deviations.

Figure 7: Pima LOF Outlier 2
pima_lof_outlier2

Figure 8: Pima LOF Inlier
pima_lof_inlier
Normal sample showing low dimensional LOF scores.

Checklist

All Submissions Basics:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?
  • Have you checked all Issues to tie the PR to a specific one?

All Submissions Cores:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
    • Added unit test test_get_outlier_explainability_scores in test_lof.py.
    • Visualization methods use # pragma: no cover.
  • Have you successfully ran tests with your changes locally?
    • All LOF tests pass.
    • Example script runs successfully.
  • Does your submission pass tests, including CircleCI, Travis CI, and AppVeyor?
  • Does your submission have appropriate code coverage?
    • Core logic covered.
    • Visualization excluded via pragma.

Files Changed

  • pyod/models/lof.py - Added explainability methods
  • examples/lof_interpretability.py - New example file
  • pyod/test/test_lof.py - Added unit test for get_outlier_explainability_scores() method

Note to Reviewer: This PR builds upon the explainability effort started in PR #652 (KNN Explainability).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant