Skip to content

Add retry logic to SWE-ReX#204

Closed
joyliu-q wants to merge 10 commits intoSWE-agent:mainfrom
joyliu-q:joy/add-retry-logic
Closed

Add retry logic to SWE-ReX#204
joyliu-q wants to merge 10 commits intoSWE-agent:mainfrom
joyliu-q:joy/add-retry-logic

Conversation

@joyliu-q
Copy link
Collaborator

@joyliu-q joyliu-q commented Jun 23, 2025

Overview

This PR adds HTTP retry logic which users often had to implement themselves.

It is built on top of #197, which uses makes remote runtime fully async with aiohttp.

Retry logic for server failures (server side and remote side)

Transient network errors cause requests to fail because requests aren't retried.

  • Client makes Request X to server
  • Request X succeeds, but connection between server to client drops
  • To Client, it looks like Request X failed

The naive solution is to retry requests to the SWE-ReX server. However, simply adding retries can lead to idempotency errors, such as the following scenario:

  • Client makes Request X to server
  • Request X succeeds, but connection between server to client drops
  • Client retries Request X-2.0 to server
  • Request X-2.0 runs again, causing a double run.

We implemented retry in this PR where each request has an associated uuid (idempotency key) generated on the client side, which are sent over in the headers of the request. Key + Response pairs are cached on the server side using a ResponseManager. Retries for the same request uses the same uuid.

This allows for retries to succeed but commands only run once:

  • Client makes Request X with Key A to server
  • Request X succeeds, the server caches Key A, but connection between server to client drops
  • Client retries Request X-2.0 with Key A to server
  • Server realizes the Key A is the same and returns cached response to client

Edge case: Concurrent Clients

This fix assumes there is only one client that issues requests to the server at a time. As such, our ResponseManager only caches by saving the latest executed Key + Response pair. If multiple concurrent clients is a use case SWE-Rex often sees, we can add a follow-up with a more complex ResponseManager that handles complex caching.

Testing

To test this behavior, we manually injected random network errors and ensured that the implementation works.

@joyliu-q joyliu-q force-pushed the joy/add-retry-logic branch 3 times, most recently from 5745f3a to 019cc08 Compare June 23, 2025 15:46
@joyliu-q joyliu-q marked this pull request as ready for review June 24, 2025 19:51
@klieret klieret requested a review from Copilot June 25, 2025 13:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds client-side retry logic with idempotency keys and server-side caching of responses to prevent duplicate executions.

  • Introduces a ResponseManager and an handle_request_id middleware in server.py to replay cached responses for repeated requests.
  • Refactors the remote runtime in remote.py to use aiohttp with exponential backoff retry, UUID-based idempotency headers, and async I/O.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/swerex/server.py Added ResponseManager and handle_request_id middleware to cache and replay request responses.
src/swerex/runtime/remote.py Switched to aiohttp, implemented retry logic with backoff, idempotency headers, and async methods.
Comments suppressed due to low confidence (2)

src/swerex/server.py:54

  • Accessing and modifying shared state in ResponseManager without synchronization can cause race conditions under concurrent requests. Consider using an asyncio.Lock or thread-safe structure.
    def get_response(self, request_id):

src/swerex/runtime/remote.py:5

  • The sys import is unused and can be removed to clean up the module.
import sys

@klieret
Copy link
Member

klieret commented Jun 26, 2025

This all looks good to me! Generally, we start one deployment per swe-agent instance, so the requests should be synchronous, so I don't think we necessarily need to generalize the caching (though it seems easy enough to do with what you've built so far already!).

I've looked through the changes and couldn't find an immediate problem, but this definitely doesn't run for me at the moment:

If I'm on main, I can very quickly do

 pytest tests/test_execution.py::test_server_alive

However, on this branch, it first fails, and then cleanup takes half a minute:

FAILED tests/test_execution.py::test_server_alive - AssertionError: assert IsAliveResponse(is_alive=False, message='Failed to connect to http://127.0.0.1\nTraceback (most recent call last):\n  ...er_exception\n    raise exception from None\nAttributeError: \'async_generator\' object has no attribute \'encode\'\n')
ERROR tests/test_execution.py::test_server_alive - RuntimeError: Event loop is closed

I can continue to investigate that later, but would also greatly appreciate it if you could take a look

@klieret
Copy link
Member

klieret commented Jun 26, 2025

OK, on second look, this might actually have to do more with problems in conftest.py.

@klieret
Copy link
Member

klieret commented Jun 26, 2025

I added two commits here: #212 that seem to make this work, but it's very very slow on the CI runner. Not sure what's going on there...

@joyliu-q joyliu-q changed the base branch from main to rebased-async-remote-runtime July 14, 2025 16:35
@joyliu-q joyliu-q force-pushed the joy/add-retry-logic branch from 3a29a62 to fd6b595 Compare July 14, 2025 16:40
@klieret klieret deleted the branch SWE-agent:main August 6, 2025 00:58
@klieret klieret closed this Aug 6, 2025
@klieret
Copy link
Member

klieret commented Aug 6, 2025

This seems to have auto-closed because it had the SWE-agent:rebased-async-remote-runtime branch as a target and I just merged that one in (and the merge head auto-delete is enabledin this reo).

@klieret klieret reopened this Aug 6, 2025
@klieret klieret changed the base branch from rebased-async-remote-runtime to main August 6, 2025 01:04
@klieret
Copy link
Member

klieret commented Aug 6, 2025

@joyliu-q Merged it all via #233 . There was still one problem that I fixed in a6151ec

@klieret
Copy link
Member

klieret commented Aug 6, 2025

Oh and I just realize that we don't actually use _request_with_retry? That's strange, maybe something got lost here?

@klieret
Copy link
Member

klieret commented Aug 6, 2025

#234 should bring the retry logic into the normal _request method

@klieret
Copy link
Member

klieret commented Aug 6, 2025

Closing this because all the commits should now be included :)

Thank you so much again ❤️ And sorry for all the delays!

@klieret klieret closed this Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants