Skip to content
This repository was archived by the owner on Sep 11, 2025. It is now read-only.

Conversation

@danstarns
Copy link
Contributor

This PR adds retry logic with exponential backoff to ensureAgentReady to handle race conditions when agents are starting or resuming.

This fixes issues where SendMessage calls immediately after StartAgent would fail with "agent not found" errors due to asynchronous agent initialization.

@danstarns danstarns requested review from a team and Copilot June 11, 2025 12:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the ensureAgentReady function by adding retry logic with exponential backoff to handle asynchronous agent initialization.

  • Introduces constants for maximum retries, base delay, and maximum delay.
  • Wraps getAgentInfo in a retry loop, handling different agent statuses (starting, suspended, terminated) and resuming suspended agents.
  • Implements backoff delays and context cancellation between attempts, with a final error when all retries are exhausted.
Comments suppressed due to low confidence (3)

runtime/actors/agents.go:180

  • The error returned from getAgentInfo is not checked, which may hide underlying failures (e.g., database errors). Consider handling err before inspecting info or pid.
info, pid, err := getAgentInfo(ctx, agentId)

runtime/actors/agents.go:172

  • [nitpick] The function now includes retry and backoff logic; please update or add a doc comment to explain the retry strategy, delay parameters, and the conditions under which it returns errors.
func ensureAgentReady(ctx context.Context, agentId string) (*AgentInfo, *goakt.PID, error) {

runtime/actors/agents.go:179

  • The new retry and resume logic covers several code paths; consider adding unit tests that simulate different agent statuses and context cancellations to verify behavior across attempts.
for attempt := 0; attempt < maxRetries; attempt++ {

}

// Wait before retrying with exponential backoff
delay := time.Duration(float64(baseDelay) * (1.5*float64(attempt) + 1))
Copy link

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current formula yields linear backoff rather than true exponential backoff. To match the PR description, consider using a power function (e.g., baseDelay * time.Duration(math.Pow(2, float64(attempt)))).

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +206
go func() {
if _, err := spawnActorForAgent(host, plugin, agentId, info.Name, false); err != nil {
logger.Err(context.Background(), err).Msgf("Failed to resume agent %s", agentId)
}
}()
Copy link

Copilot AI Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using context.Background() for logging drops metadata (tracing, cancellation) from the original context. Consider deriving a child context or using the existing ctx for richer diagnostics.

Suggested change
go func() {
if _, err := spawnActorForAgent(host, plugin, agentId, info.Name, false); err != nil {
logger.Err(context.Background(), err).Msgf("Failed to resume agent %s", agentId)
}
}()
go func(parentCtx context.Context) {
derivedCtx := context.WithValue(parentCtx, "operation", "resumeAgent")
if _, err := spawnActorForAgent(host, plugin, agentId, info.Name, false); err != nil {
logger.Err(derivedCtx, err).Msgf("Failed to resume agent %s", agentId)
}
}(ctx)

Copilot uses AI. Check for mistakes.
@mattjohnsonpint
Copy link
Contributor

Thanks for pointing out the issue, and for the PR. I decided instead to go with the approach in #890.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants