Skip to content

Add rrcf outlier#846

Merged
ianhelle merged 18 commits intomicrosoft:mainfrom
Tatsuya-hasegawa:add_rrcf_outlier
Jun 6, 2025
Merged

Add rrcf outlier#846
ianhelle merged 18 commits intomicrosoft:mainfrom
Tatsuya-hasegawa:add_rrcf_outlier

Conversation

@Tatsuya-hasegawa
Copy link
Contributor

Hello Ian,

I implemented Robust Random Cut Forest class and the outlier function.
RRCF is more useful when we detect anomaly of time series data as you know.

imported Robust Random Cut Forest python library
MIT License Copyright (c) 2018 kLabUM
https://klabum.github.io/rrcf/

Best regards,
Tatsuya

@Tatsuya-hasegawa
Copy link
Contributor Author

Tatsuya-hasegawa commented Apr 23, 2025

The PoC notebooks are located in https://github.com/Tatsuya-hasegawa/MSTICPy_utils/tree/main/analysis_rrcf_outliers

RRCF can also use "from msticpy.analysis.outliers import plot_outlier_results" function, however that of RRCF is very slow.
On the other hand, I think RRCF anomaly detection accuracy for time series data is worth.

For instance, this is one of DNS traffic anomaly case.

Isolation Forest result is like below.
IsolationForest

Robust Random Cut Forest is smart like below.
RRCF

Since scikit-learn does not support Robust Random Cut Forest, I implemented RRCF in msticpy in Python instead of Cpython natively.
Although it is slower, it is certainly a better anomaly detection algorithm for time series event data.

Copy link
Contributor

@ianhelle ianhelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to trust your expertise on how this works and is implemented (which I do).
You might want to include one of the notebooks and add it to the notebooks folder.

I have a couple of lightweight comments - more about format than functionality.

@Tatsuya-hasegawa
Copy link
Contributor Author

Hi , Thanks for your many advices.
I think I was able to modify all the points from you except for including one of the notebooks and add it to the notebooks folder.
I will do it on another PR next time when I have get time.
Best regards,

@Tatsuya-hasegawa Tatsuya-hasegawa requested a review from ianhelle May 23, 2025 21:22
@ianhelle
Copy link
Contributor

Hi Tatsuya,
There are a bunch of typing warnings from mypy for outliers. In many cases it maybe as simple as declaring params as int | float.
Use:

from __future__ import annotations

at the top of the imports to ensure that Py3.8 supports this syntax.

Also - not a hard requirement - if you have any documentation to add about this (now impressive) module, it would def make it more visible to others. Also if you want to publish an article somewhere (we have a msticpy blog that hasn't been used for ages), we can include that in the release and linkedin/X posts.
Or just include the notebooks in the PR.
Great stuff, thx for the submission.

@Tatsuya-hasegawa Tatsuya-hasegawa force-pushed the add_rrcf_outlier branch 2 times, most recently from c503721 to f237dbc Compare May 26, 2025 23:22
@Tatsuya-hasegawa
Copy link
Contributor Author

Hi Ian,

Thank you for the advices.
I fixed the typing warning and my local check was fine.
FYI, I accidentally did a force push, but I changed it back, pulled, and then pushed again.

(base) hacket@hackeTlab msticpy % git commit -m "fixed CI/CD errors"  
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
black....................................................................Passed
pylint...................................................................Passed
flake8...................................................................Passed
isort (python)...........................................................Passed
pydocstyle...............................................................Passed
ruff.....................................................................Passed
check_reqs_all...........................................................Passed
[add_rrcf_outlier c5037219] fixed CI/CD errors
 1 file changed, 19 insertions(+), 16 deletions(-)

On the other hand, I'm not sure for the compatibility to Python3.8...
After some research with AI, I found that the "type OR operator (short for Union)" like "int | str" that msticpy is trying to migrate is a syntax introduced in Python 3.10 and later, and it doesn't seem to be supported in 3.8 even if you use from future import annotations.
So, please run CI/CD workflow again to check it.

Also, about notebook documentation and artcle, I'm preparing some notebook files both the current IsolationForest outlier and this RRCF outliers. I'm examining the results of differences for the same datasets.

For time series data, Isolation Forest is known to be effective at detecting simple spikes, while RRCF is effective at detecting trend changes and correlations in multidimensional features. For non-time series data, Isolation Forest is also found to be faster and more accurate.

@ianhelle
Copy link
Contributor

On the other hand, I'm not sure for the compatibility to Python3.8... After some research with AI, I found that the "type OR operator (short for Union)" like "int | str" that msticpy is trying to migrate is a syntax introduced in Python 3.10 and later, and it doesn't seem to be supported in 3.8 even if you use from future import annotations. So, please run CI/CD workflow again to check it.

This should work fine - we're using future annotations in multiple places. Am building now.

@Tatsuya-hasegawa
Copy link
Contributor Author

Tatsuya-hasegawa commented May 27, 2025

OMG, I'm sorry.
6 typing errors are still existing. I'll fix it day after tomorrow with also adding jupyter notebook documents.

In addition, I'll add the joblib package to pip requirement as same as rrcf package.
Missing packages:
joblib:analysis.outliers.py

@ianhelle
Copy link
Contributor

No worries, take your time. :-)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Tatsuya-hasegawa
Copy link
Contributor Author

Hi Ian,

I have finished my tasks with adding some jupyter notebooks!

Best regards,

(base) hacket@hackeTlab msticpy % git commit -m "fixed the rest typing errors and add jupyter notebooks for comparing the outlier algorisms"
check yaml...........................................(no files to check)Skipped
check json...............................................................Passed
trim trailing whitespace.................................................Passed
black....................................................................Passed
pylint...................................................................Passed
flake8...................................................................Passed
isort (python)...........................................................Passed
pydocstyle...............................................................Passed
ruff.....................................................................Passed
check_reqs_all...........................................................Passed
[add_rrcf_outlier 2d66adab] fixed the rest typing errors and add jupyter notebooks for comparing the outlier algorisms
 8 files changed, 5673 insertions(+), 6 deletions(-)
 create mode 100644 docs/notebooks/Outliers-IsolationForest.ipynb
 create mode 100644 docs/notebooks/Outliers-IsolationForest_timeseries.ipynb
 create mode 100644 docs/notebooks/Outliers-RobustRandomCutForest.ipynb
 create mode 100644 docs/notebooks/Outliers-RobustRandomCutForest_timeseries.ipynb

@Tatsuya-hasegawa
Copy link
Contributor Author

Weird...
What's that? Many errors are probably out of scope.

@ianhelle
Copy link
Contributor

ianhelle commented Jun 4, 2025

Yeah - looks like something changed with maybe a mypy update and possible a respx update.
I will checkout your PR and fix these

@ianhelle ianhelle merged commit 05806aa into microsoft:main Jun 6, 2025
10 checks passed
raj-axe pushed a commit to raj-axe/msticpy that referenced this pull request Aug 11, 2025
* add robust_random_cut_forest to outliers

* modified docstrings, typing to builtin and rrcf module install

* fixed max_samples parameter pass to RRCF class

* fixed CI/CD errors

* fixed the rest typing errors and add jupyter notebooks for comparing the outlier algorisms

* Fixing some mypy and test errors

* Fixing and/or supressing mypy warnings - I think it doesn't have a good understanding of numpy

* pylint fixes

* Fixing type annotation in cast for Py3.8 in sentinel_utils.py

* Fixing version checking logic

* Fixing typo in nbinit

---------

Co-authored-by: Ian Hellen <ianhelle@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants