Skip to content

Conversation

@natea
Copy link

@natea natea commented Aug 15, 2025

Summary

This PR fixes issue #107873 by adding retry logic to the checkForMissingDataIfNecessary method in DatafeedJob. The change improves resilience when checking for missing data encounters transient failures.

Problem

The datafeed's missing data check could fail due to transient issues (network problems, temporary unavailability of indices, etc.), causing the entire check to be skipped. This meant that delayed data detection would not work properly during these transient failures.

Solution

Implemented a retry mechanism with exponential backoff that:

  • Retries up to 3 times when detectMissingData() throws an exception
  • Uses exponential backoff delays (100ms, 200ms, 400ms) between retries
  • Logs warnings on each retry attempt for observability
  • Issues an audit warning if all retries fail
  • Allows the datafeed to continue operating even if the missing data check ultimately fails

Changes

  • DatafeedJob.java: Added retry logic with exponential backoff in checkForMissingDataIfNecessary() method
  • DatafeedJobRetryTests.java: Added comprehensive unit tests covering:
    • Successful retry after initial failures
    • Failure after exhausting all retries
    • Immediate success without retries
    • Exponential backoff delay verification

Testing

Added unit tests that verify:

  • ✅ Retry mechanism triggers on failure
  • ✅ Exponential backoff delays are applied correctly
  • ✅ Appropriate warnings are logged
  • ✅ Datafeed continues operation even if all retries fail

Related Issues

Fixes #107873

Checklist

  • I have signed the Contributor License Agreement
  • I have run the tests locally and they pass
  • I have added tests that prove my fix is effective
  • My changes follow the existing code style
  • I have added appropriate logging for observability

Notes

The retry mechanism is conservative with a maximum of 3 retries and relatively short backoff delays to avoid significantly delaying the datafeed operation. The total maximum delay from retries is approximately 700ms (100 + 200 + 400ms).

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

natea and others added 2 commits August 15, 2025 10:47
Exclude .claude-flow/, .swarm/ directories and temporary test/analysis files
from version control to keep the repository clean.
Implement retry mechanism with exponential backoff for the
checkForMissingDataIfNecessary method in DatafeedJob. This improves
resilience when checking for missing data encounters transient failures.

Changes:
- Add retry logic with up to 3 retries using exponential backoff
- Log warnings on retry attempts
- Issue audit warning if all retries fail
- Continue datafeed operation even if missing data check fails
- Add comprehensive unit tests for retry scenarios

The backoff delays are: 100ms, 200ms, 400ms for successive retries.

Fixes elastic#107873

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team needs:triage Requires assignment of a team area label v9.2.0 labels Aug 15, 2025
@szybia szybia added :ml Machine learning and removed needs:triage Requires assignment of a team area label labels Aug 18, 2025
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Aug 18, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle
Copy link
Member

@elasticmachine test this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] Datafeed should retry failures while checking for missing data

5 participants