-
Notifications
You must be signed in to change notification settings - Fork 27
Add retry logic with exponential backoff ensureAgentReady #887
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the ensureAgentReady function by adding retry logic with exponential backoff to handle asynchronous agent initialization.
- Introduces constants for maximum retries, base delay, and maximum delay.
- Wraps
getAgentInfoin a retry loop, handling different agent statuses (starting, suspended, terminated) and resuming suspended agents. - Implements backoff delays and context cancellation between attempts, with a final error when all retries are exhausted.
Comments suppressed due to low confidence (3)
runtime/actors/agents.go:180
- The error returned from getAgentInfo is not checked, which may hide underlying failures (e.g., database errors). Consider handling
errbefore inspectinginfoorpid.
info, pid, err := getAgentInfo(ctx, agentId)
runtime/actors/agents.go:172
- [nitpick] The function now includes retry and backoff logic; please update or add a doc comment to explain the retry strategy, delay parameters, and the conditions under which it returns errors.
func ensureAgentReady(ctx context.Context, agentId string) (*AgentInfo, *goakt.PID, error) {
runtime/actors/agents.go:179
- The new retry and resume logic covers several code paths; consider adding unit tests that simulate different agent statuses and context cancellations to verify behavior across attempts.
for attempt := 0; attempt < maxRetries; attempt++ {
| } | ||
|
|
||
| // Wait before retrying with exponential backoff | ||
| delay := time.Duration(float64(baseDelay) * (1.5*float64(attempt) + 1)) |
Copilot
AI
Jun 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current formula yields linear backoff rather than true exponential backoff. To match the PR description, consider using a power function (e.g., baseDelay * time.Duration(math.Pow(2, float64(attempt)))).
| go func() { | ||
| if _, err := spawnActorForAgent(host, plugin, agentId, info.Name, false); err != nil { | ||
| logger.Err(context.Background(), err).Msgf("Failed to resume agent %s", agentId) | ||
| } | ||
| }() |
Copilot
AI
Jun 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using context.Background() for logging drops metadata (tracing, cancellation) from the original context. Consider deriving a child context or using the existing ctx for richer diagnostics.
| go func() { | |
| if _, err := spawnActorForAgent(host, plugin, agentId, info.Name, false); err != nil { | |
| logger.Err(context.Background(), err).Msgf("Failed to resume agent %s", agentId) | |
| } | |
| }() | |
| go func(parentCtx context.Context) { | |
| derivedCtx := context.WithValue(parentCtx, "operation", "resumeAgent") | |
| if _, err := spawnActorForAgent(host, plugin, agentId, info.Name, false); err != nil { | |
| logger.Err(derivedCtx, err).Msgf("Failed to resume agent %s", agentId) | |
| } | |
| }(ctx) |
|
Thanks for pointing out the issue, and for the PR. I decided instead to go with the approach in #890. |
This PR adds retry logic with exponential backoff to ensureAgentReady to handle race conditions when agents are starting or resuming.
This fixes issues where SendMessage calls immediately after StartAgent would fail with "agent not found" errors due to asynchronous agent initialization.