Skip to content

Conversation

@HyeockJinKim
Copy link
Collaborator

Implement exponential backoff strategy in RedisConsumer to prevent CPU spinning during Redis connection failures.

Changes:

  • Add _BackoffState dataclass to track per-stream backoff state
  • Add backoff configuration to RedisConsumerArgs (initial_delay,
    max_delay, max_attempts)
  • Implement _handle_backoff() with exponential calculation and jitter
  • Implement _reset_backoff() to reset state on successful operations
  • Apply backoff in _read_messages_loop() and _auto_claim_loop() on GlideError and generic Exception

Backoff progression: 0.1s → 0.2s → 0.4s → ... → max 30s Jitter: 50-100% of calculated delay to prevent thundering herd

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

resolves #NNN (BA-MMM)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Implement exponential backoff strategy in RedisConsumer to prevent
CPU spinning during Redis connection failures.

Changes:
- Add _BackoffState dataclass to track per-stream backoff state
- Add backoff configuration to RedisConsumerArgs (initial_delay,
  max_delay, max_attempts)
- Implement _handle_backoff() with exponential calculation and jitter
- Implement _reset_backoff() to reset state on successful operations
- Apply backoff in _read_messages_loop() and _auto_claim_loop()
  on GlideError and generic Exception

Backoff progression: 0.1s → 0.2s → 0.4s → ... → max 30s
Jitter: 50-100% of calculated delay to prevent thundering herd

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 27, 2026 20:26
@github-actions github-actions bot added size:M 30~100 LoC comp:common Related to Common component labels Jan 27, 2026
@HyeockJinKim HyeockJinKim added this pull request to the merge queue Jan 27, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an exponential backoff strategy to RedisConsumer to avoid CPU spinning during Redis connection failures.

Changes:

  • Introduces per-stream backoff state tracking via _BackoffState.
  • Adds backoff configuration fields to RedisConsumerArgs.
  • Applies backoff delays on GlideError and generic exceptions in read and autoclaim loops.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/ai/backend/common/message_queue/redis_queue/consumer.py Adds backoff configuration/state and applies exponential backoff with jitter on reconnect/error paths.
changes/8387.feature.md Adds a changelog entry for the new backoff behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +316 to 317
self._reset_backoff(stream_key)
continue
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backoff is only reset when claimed is truthy, but a successful autoclaim that returns claimed == False still indicates the operation succeeded. If there were prior errors, the consumer will keep using an inflated backoff even though the connection has recovered. Consider calling _reset_backoff(stream_key) after a successful autoclaim request regardless of whether any messages were claimed (i.e., outside the if claimed: block).

Suggested change
self._reset_backoff(stream_key)
continue
self._reset_backoff(stream_key)

Copilot uses AI. Check for mistakes.
stream_key, self._group_name, message.msg_id, message.payload
)

async def _handle_backoff(self, stream_key: str) -> None:
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds backoff_max_attempts to RedisConsumerArgs, but _handle_backoff() never enforces it. This makes the configuration misleading and prevents callers from bounding retries. Consider checking self._backoff_max_attempts after state.increment() and, when exceeded, either raise (to stop the loop), close the consumer, or log and rethrow the last exception so the task fails deterministically.

Copilot uses AI. Check for mistakes.
Comment on lines +395 to +405
if stream_key not in self._backoff_state:
self._backoff_state[stream_key] = _BackoffState()

state = self._backoff_state[stream_key]
state.increment()

# Calculate delay with exponential backoff
delay = min(
self._backoff_initial_delay * (2 ** (state.attempt - 1)),
self._backoff_max_delay,
)
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds backoff_max_attempts to RedisConsumerArgs, but _handle_backoff() never enforces it. This makes the configuration misleading and prevents callers from bounding retries. Consider checking self._backoff_max_attempts after state.increment() and, when exceeded, either raise (to stop the loop), close the consumer, or log and rethrow the last exception so the task fails deterministically.

Copilot uses AI. Check for mistakes.
Comment on lines +410 to +411
await asyncio.sleep(actual_delay)

Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backoff sleep does not account for shutdown. If self._closed becomes true while sleeping, the task won’t observe it until the sleep completes, which can delay shutdown by up to backoff_max_delay. Consider short-circuiting before sleeping when closed, or sleeping in a way that can be interrupted (e.g., waiting on a shutdown event / task cancellation-aware wait).

Suggested change
await asyncio.sleep(actual_delay)
# Sleep in small chunks so that shutdown (_closed) can be observed promptly.
if getattr(self, "_closed", False):
return
loop = asyncio.get_running_loop()
deadline = loop.time() + actual_delay
while True:
if getattr(self, "_closed", False):
return
remaining = deadline - loop.time()
if remaining <= 0:
break
# Sleep for at most 1 second at a time to re-check shutdown state.
try:
await asyncio.sleep(min(1.0, remaining))
except asyncio.CancelledError:
# Allow task cancellation to propagate for cooperative shutdown.
raise

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +48
last_error_time: float = 0.0

def increment(self) -> None:
"""Increment attempt counter and record error time."""
self.attempt += 1
self.last_error_time = time.monotonic()

def reset(self) -> None:
"""Reset backoff state after successful operation."""
self.attempt = 0
self.last_error_time = 0.0
Copy link

Copilot AI Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_error_time is written but (in this diff) never read, which adds state and an extra dependency (time.monotonic()) without affecting behavior. Either remove last_error_time until it’s needed, or use it (e.g., for logging, metrics, or to avoid backoff increments within a small time window).

Suggested change
last_error_time: float = 0.0
def increment(self) -> None:
"""Increment attempt counter and record error time."""
self.attempt += 1
self.last_error_time = time.monotonic()
def reset(self) -> None:
"""Reset backoff state after successful operation."""
self.attempt = 0
self.last_error_time = 0.0
def increment(self) -> None:
"""Increment attempt counter after an error."""
self.attempt += 1
def reset(self) -> None:
"""Reset backoff state after successful operation."""
self.attempt = 0

Copilot uses AI. Check for mistakes.
Merged via the queue into main with commit f88e265 Jan 27, 2026
30 checks passed
@HyeockJinKim HyeockJinKim deleted the feat/add-backoff-redis branch January 27, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component size:M 30~100 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants