Skip to content

Correlating Anomalies via Temporal Overlap Similarity#1641

Merged
kaituo merged 2 commits intoopensearch-project:mainfrom
kaituo:correlation
Jan 22, 2026
Merged

Correlating Anomalies via Temporal Overlap Similarity#1641
kaituo merged 2 commits intoopensearch-project:mainfrom
kaituo:correlation

Conversation

@kaituo
Copy link
Copy Markdown
Collaborator

@kaituo kaituo commented Jan 6, 2026

Description

OpenSearch anomalies such as service degradation, job delays, and incident bursts are represented as time intervals, not isolated points. If two detectors fire on the same incident, their anomaly intervals will substantially overlap in time (might with a little timestamp jitter due to different interval, detector start time, and causal relationship). Our similarity therefore measures:

  • how much the time windows overlap (after a small tolerance δ to account for jitter),
  • optionally, whether the duration is consistent.

This PR implements threshold-graph + connected components based on similarity.

Major algorithm:

  • De-dupe input anomalies by id (stable insertion order).
  • For every pair (i,j):
    • Dilate both time intervals by ±delta to tolerate bucket alignment drift.
    • Require dilated overlap >= minOverlap (cheap early filter).
    • Compute temporal overlap:
      • IoU (Jaccard over time) on dilated intervals
      • Overlap coefficient (overlap / min(lenA,lenB)) for containment cases
    • Detect strong containment (ovl >= tauContain and duration ratio <= rhoMax).
    • Pick temporal term by mode:
      • IOU: use IoU
      • OVL: use overlap coefficient
      • HYBRID: if strong containment, blend ((1-lam)IoU + lamOVL); else use IoU
    • Compute duration penalty exp(-|durA-durB|/kappa).
      • If strong containment, relax the penalty via pow(basePen, containmentRelax) (or disable penalty entirely when containmentRelax == 0).
    • Similarity = temporalTerm * penalty; add an undirected edge if similarity >= alpha.
  • Run DFS connected-components on the threshold graph to form clusters.
  • Output deterministically: sort members in each cluster by anomaly id.
  • Attach an event window per cluster as [min(start), max(end)] across its members.

Testing done:

  1. UT
  2. Tests on real world data

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@opensearch-trigger-bot opensearch-trigger-bot bot added documentation Improvements or additions to documentation backport 2.x labels Jan 6, 2026
@kaituo kaituo added feature new feature and removed documentation Improvements or additions to documentation labels Jan 6, 2026
@kaituo kaituo force-pushed the correlation branch 2 times, most recently from 09dad76 to 53bebcc Compare January 6, 2026 20:51
@codecov
Copy link
Copy Markdown

codecov bot commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 75.07418% with 84 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.29%. Comparing base (0199c1c) to head (b3eb632).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../opensearch/ad/correlation/AnomalyCorrelation.java 74.36% 44 Missing and 37 partials ⚠️
...in/java/org/opensearch/ad/correlation/Anomaly.java 85.71% 1 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##               main    #1641      +/-   ##
============================================
- Coverage     81.36%   81.29%   -0.08%     
- Complexity     6151     6237      +86     
============================================
  Files           542      544       +2     
  Lines         24986    25323     +337     
  Branches       2543     2621      +78     
============================================
+ Hits          20331    20587     +256     
- Misses         3383     3424      +41     
- Partials       1272     1312      +40     
Flag Coverage Δ
plugin 81.29% <75.07%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...in/java/org/opensearch/ad/correlation/Anomaly.java 85.71% <85.71%> (ø)
.../opensearch/ad/correlation/AnomalyCorrelation.java 74.36% <74.36%> (ø)

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jackiehanyang
Copy link
Copy Markdown
Collaborator

should we not allow grouping anomalies from the same detector if they overlap or are adjacent?

@jackiehanyang
Copy link
Copy Markdown
Collaborator

It seems like the current implementation uses a brute force pairwise comparison in nested loops, compares every anomaly with every other anomaly, so the time complexity is O(n^2). Could we sorts anomalies by start time and only compares active overlapping intervals, making it more efficient for large datasets?

@kaituo
Copy link
Copy Markdown
Collaborator Author

kaituo commented Jan 8, 2026

should we not allow grouping anomalies from the same detector if they overlap or are adjacent?

We don't allow group anomalies from the same entities (e.g., model ids). But grouping anomalies from the same detector is allowed since we may have high cardinality detectors. Also for single-stream detectors, if anomalies are adjacent, it may make sense to combine them.

@kaituo
Copy link
Copy Markdown
Collaborator Author

kaituo commented Jan 8, 2026

It seems like the current implementation uses a brute force pairwise comparison in nested loops, compares every anomaly with every other anomaly, so the time complexity is O(n^2). Could we sorts anomalies by start time and only compares active overlapping intervals, making it more efficient for large datasets?

Changed to use active overlapping intervals. See diff: https://github.com/opensearch-project/anomaly-detection/compare/fca7de0c05300a4322cff6d97c8644fe85df5d0b..82a03a2bf142310a23a8b9a1f662e739e3c35cd5

@jackiehanyang
Copy link
Copy Markdown
Collaborator

jackiehanyang commented Jan 12, 2026

should we not allow grouping anomalies from the same detector if they overlap or are adjacent?

We don't allow group anomalies from the same entities (e.g., model ids). But grouping anomalies from the same detector is allowed since we may have high cardinality detectors. Also for single-stream detectors, if anomalies are adjacent, it may make sense to combine them.

I mean more like excluding groups that only have one entity in it. Results that only have one entity in it are not presentable on dashboard to customers. Should we stop generating those?

@kaituo
Copy link
Copy Markdown
Collaborator Author

kaituo commented Jan 12, 2026

should we not allow grouping anomalies from the same detector if they overlap or are adjacent?

We don't allow group anomalies from the same entities (e.g., model ids). But grouping anomalies from the same detector is allowed since we may have high cardinality detectors. Also for single-stream detectors, if anomalies are adjacent, it may make sense to combine them.

I mean more like excluding groups that only have one entity in it. Results that only have one entity in it are not presentable on dashboard to customers. Should we stop generating those?

added a parameter to exclude groups that only have one entity in it: https://github.com/opensearch-project/anomaly-detection/compare/82a03a2bf142310a23a8b9a1f662e739e3c35cd5..31ed352793d65e8bbcbe219da48bbe76ddce76f2

LinkedHashMap<String, Anomaly> deduped = new LinkedHashMap<>();
for (Anomaly anomaly : anomalies) {
Objects.requireNonNull(anomaly, "anomaly");
String id = Objects.requireNonNull(anomaly.getId(), "anomaly.id");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anomaly id will always be unique? Should we dedupe the input list by detector id, entity, start, end before calling AnomalyCorrelation?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anomaly id is model id. It is unique. I will rename id to model id to be explicit. Yes, I can dedup by start and end. Model id will uniquely determine detector id and entity.

@jackiehanyang
Copy link
Copy Markdown
Collaborator

jackiehanyang commented Jan 15, 2026

After running the algorithm on different data, I noticed the clustering results tend to be fragmented - multiple clusters with a small number (usually 3 - 5) of entities in it. The time gap in between different clusters are usually 30-60 minutes. Is there any improvement we can done to bridge across short quiet periods?

@kaituo
Copy link
Copy Markdown
Collaborator Author

kaituo commented Jan 21, 2026

After running the algorithm on different data, I noticed the clustering results tend to be fragmented - multiple clusters with a small number (usually 3 - 5) of entities in it. The time gap in between different clusters are usually 30-60 minutes. Is there any improvement we can done to bridge across short quiet periods?

add more dilation in the start of an anomaly so that if two anomalies don’t overlap (even after dilation), they can still have a chance to be correlated: b3eb632

OpenSearch anomalies such as service degradation, job delays, and incident bursts are represented as time intervals, not isolated points. If two detectors fire on the same incident, their anomaly intervals will substantially overlap in time (might with a little timestamp jitter due to different interval, detector start time, and causal relationship). Our similarity therefore measures:

* how much the time windows overlap (after a small tolerance δ to account for jitter),
* optionally, whether the duration is consistent.

This PR implements threshold-graph + connected components based on similarity.

Major algorithm:
- De-dupe input anomalies by id (stable insertion order).
- For every pair (i,j):
  - Dilate both time intervals by ±delta to tolerate bucket alignment drift.
  - Require dilated overlap >= minOverlap (cheap early filter).
  - Compute temporal overlap:
    - IoU (Jaccard over time) on dilated intervals
    - Overlap coefficient (overlap / min(lenA,lenB)) for containment cases
  - Detect strong containment (ovl >= tauContain and duration ratio <= rhoMax).
  - Pick temporal term by mode:
    - IOU: use IoU
    - OVL: use overlap coefficient
    - HYBRID: if strong containment, blend ((1-lam)*IoU + lam*OVL); else use IoU
  - Compute duration penalty exp(-|durA-durB|/kappa).
    - If strong containment, relax the penalty via pow(basePen, containmentRelax)
      (or disable penalty entirely when containmentRelax == 0).
  - Similarity = temporalTerm * penalty; add an undirected edge if similarity >= alpha.
- Run DFS connected-components on the threshold graph to form clusters.
- Output deterministically: sort members in each cluster by anomaly id.
- Attach an event window per cluster as [min(start), max(end)] across its members.

Testing done:
1. UT
2. Tests on real world data

Signed-off-by: Kaituo Li <kaituo@amazon.com>
Signed-off-by: kaituo <kaituo@amazon.com>
@kaituo kaituo merged commit ab3d82f into opensearch-project:main Jan 22, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants