Correlating Anomalies via Temporal Overlap Similarity#1641
Correlating Anomalies via Temporal Overlap Similarity#1641kaituo merged 2 commits intoopensearch-project:mainfrom
Conversation
09dad76 to
53bebcc
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1641 +/- ##
============================================
- Coverage 81.36% 81.29% -0.08%
- Complexity 6151 6237 +86
============================================
Files 542 544 +2
Lines 24986 25323 +337
Branches 2543 2621 +78
============================================
+ Hits 20331 20587 +256
- Misses 3383 3424 +41
- Partials 1272 1312 +40
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
should we not allow grouping anomalies from the same detector if they overlap or are adjacent? |
|
It seems like the current implementation uses a brute force pairwise comparison in nested loops, compares every anomaly with every other anomaly, so the time complexity is O(n^2). Could we sorts anomalies by start time and only compares active overlapping intervals, making it more efficient for large datasets? |
We don't allow group anomalies from the same entities (e.g., model ids). But grouping anomalies from the same detector is allowed since we may have high cardinality detectors. Also for single-stream detectors, if anomalies are adjacent, it may make sense to combine them. |
Changed to use active overlapping intervals. See diff: https://github.com/opensearch-project/anomaly-detection/compare/fca7de0c05300a4322cff6d97c8644fe85df5d0b..82a03a2bf142310a23a8b9a1f662e739e3c35cd5 |
I mean more like excluding groups that only have one entity in it. Results that only have one entity in it are not presentable on dashboard to customers. Should we stop generating those? |
added a parameter to exclude groups that only have one entity in it: https://github.com/opensearch-project/anomaly-detection/compare/82a03a2bf142310a23a8b9a1f662e739e3c35cd5..31ed352793d65e8bbcbe219da48bbe76ddce76f2 |
| LinkedHashMap<String, Anomaly> deduped = new LinkedHashMap<>(); | ||
| for (Anomaly anomaly : anomalies) { | ||
| Objects.requireNonNull(anomaly, "anomaly"); | ||
| String id = Objects.requireNonNull(anomaly.getId(), "anomaly.id"); |
There was a problem hiding this comment.
Anomaly id will always be unique? Should we dedupe the input list by detector id, entity, start, end before calling AnomalyCorrelation?
There was a problem hiding this comment.
anomaly id is model id. It is unique. I will rename id to model id to be explicit. Yes, I can dedup by start and end. Model id will uniquely determine detector id and entity.
|
After running the algorithm on different data, I noticed the clustering results tend to be fragmented - multiple clusters with a small number (usually 3 - 5) of entities in it. The time gap in between different clusters are usually 30-60 minutes. Is there any improvement we can done to bridge across short quiet periods? |
add more dilation in the start of an anomaly so that if two anomalies don’t overlap (even after dilation), they can still have a chance to be correlated: b3eb632 |
OpenSearch anomalies such as service degradation, job delays, and incident bursts are represented as time intervals, not isolated points. If two detectors fire on the same incident, their anomaly intervals will substantially overlap in time (might with a little timestamp jitter due to different interval, detector start time, and causal relationship). Our similarity therefore measures:
* how much the time windows overlap (after a small tolerance δ to account for jitter),
* optionally, whether the duration is consistent.
This PR implements threshold-graph + connected components based on similarity.
Major algorithm:
- De-dupe input anomalies by id (stable insertion order).
- For every pair (i,j):
- Dilate both time intervals by ±delta to tolerate bucket alignment drift.
- Require dilated overlap >= minOverlap (cheap early filter).
- Compute temporal overlap:
- IoU (Jaccard over time) on dilated intervals
- Overlap coefficient (overlap / min(lenA,lenB)) for containment cases
- Detect strong containment (ovl >= tauContain and duration ratio <= rhoMax).
- Pick temporal term by mode:
- IOU: use IoU
- OVL: use overlap coefficient
- HYBRID: if strong containment, blend ((1-lam)*IoU + lam*OVL); else use IoU
- Compute duration penalty exp(-|durA-durB|/kappa).
- If strong containment, relax the penalty via pow(basePen, containmentRelax)
(or disable penalty entirely when containmentRelax == 0).
- Similarity = temporalTerm * penalty; add an undirected edge if similarity >= alpha.
- Run DFS connected-components on the threshold graph to form clusters.
- Output deterministically: sort members in each cluster by anomaly id.
- Attach an event window per cluster as [min(start), max(end)] across its members.
Testing done:
1. UT
2. Tests on real world data
Signed-off-by: Kaituo Li <kaituo@amazon.com>
Signed-off-by: kaituo <kaituo@amazon.com>
Description
OpenSearch anomalies such as service degradation, job delays, and incident bursts are represented as time intervals, not isolated points. If two detectors fire on the same incident, their anomaly intervals will substantially overlap in time (might with a little timestamp jitter due to different interval, detector start time, and causal relationship). Our similarity therefore measures:
This PR implements threshold-graph + connected components based on similarity.
Major algorithm:
Testing done:
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.