Skip to content

Conversation

erayack
Copy link

@erayack erayack commented Oct 16, 2025

Summary

This PR fixes issue #599 by improving the backoff retry mechanism in the TAP agent. The issue described that when a sender is denied, it aggregates too slowly since RAV requests are only triggered by UpdateReceiptFees messages after a fixed retry_interval (~30 secs), making the aggregation process too slow when the aggregator is UP.

Problem

As described in issue #599:

  • When a sender is denied, it aggregates too slowly
  • RAV requests are only triggered by UpdateReceiptFees messages
  • Those messages are only sent when the RAV request finishes or after retry_interval (~30 secs)
  • This makes aggregation too slow when the aggregator is UP but the tap-agent was down for a while

Solution

Use backoff information to retry aggregating instead of a fixed retry_interval (as suggested in the issue):

Backoff System Improvements

  • Refactored BackoffInfo: Replaced in_backoff() with remaining() method for more precise timing control
  • Added min_remaining_backoff(): New method in SenderFeeTracker to get the minimum remaining backoff across all allocations
  • Dynamic retry scheduling: Implemented schedule_retry() method that considers both global and allocation-specific backoff states
  • Removed hardcoded intervals: Eliminated retry_interval from SenderAccountArgs in favor of dynamic backoff-based scheduling

Testing & Quality

  • Unit tests: Added retry_helpers_tests module with tests for:
    • BackoffInfo::remaining() reset behavior after successful operations
    • SenderFeeTracker::min_remaining_backoff() roundtrip functionality
  • Fixed syntax error: Resolved missing closing brace in tracker.rs
  • Code formatting: Applied rustfmt standards to all modified files

Technical Details

The retry mechanism now works as follows:

  1. When a RAV request fails, both global (BackoffInfo) and allocation-specific (SenderFeeTracker) backoff states are updated
  2. The next_retry_delay() method calculates the maximum of global and allocation-specific remaining backoff times
  3. Retries are scheduled dynamically based on actual backoff state rather than fixed intervals
  4. This ensures more efficient retry behavior and prevents unnecessary rapid retries

Testing

The new functionality has been tested with standalone tests that verify:

  • Exponential backoff calculation (100ms → 200ms → 400ms, etc.)
  • 60-second backoff cap enforcement
  • Proper reset behavior after successful operations
  • Integration between global and allocation-specific backoff states

Files Modified

  • crates/tap-agent/src/agent/sender_account.rs - Main retry logic improvements
  • crates/tap-agent/src/agent/sender_accounts_manager.rs - Removed hardcoded retry interval
  • crates/tap-agent/src/backoff.rs - Enhanced backoff information tracking
  • crates/tap-agent/src/test.rs - Added comprehensive unit tests
  • crates/tap-agent/src/tracker.rs - Added min_remaining_backoff method and fixed syntax

Fixes

Closes #599

- Refactor BackoffInfo to use remaining() method instead of in_backoff()
- Add min_remaining_backoff() method to SenderFeeTracker
- Implement dynamic retry scheduling based on backoff state
- Remove hardcoded retry_interval from SenderAccountArgs
- Add comprehensive unit tests for backoff functionality
- Fix syntax error in tracker.rs (missing closing brace)
- Improve code formatting and documentation

This addresses issues with the retry mechanism by making it more
adaptive and removing hardcoded retry intervals in favor of
dynamic backoff-based scheduling.
@erayack erayack changed the title fix: improve backoff retry mechanism and add comprehensive tests fix: improve backoff retry mechanism Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Denied sender aggregates too slow

1 participant