Add retry logic to SWE-ReX by joyliu-q · Pull Request #204 · SWE-agent/SWE-ReX

joyliu-q · 2025-06-23T14:44:48Z

Overview

This PR adds HTTP retry logic which users often had to implement themselves.

It is built on top of #197, which uses makes remote runtime fully async with aiohttp.

Retry logic for server failures (server side and remote side)

Transient network errors cause requests to fail because requests aren't retried.

Client makes Request X to server
Request X succeeds, but connection between server to client drops
To Client, it looks like Request X failed

The naive solution is to retry requests to the SWE-ReX server. However, simply adding retries can lead to idempotency errors, such as the following scenario:

Client makes Request X to server
Request X succeeds, but connection between server to client drops
Client retries Request X-2.0 to server
Request X-2.0 runs again, causing a double run.

We implemented retry in this PR where each request has an associated uuid (idempotency key) generated on the client side, which are sent over in the headers of the request. Key + Response pairs are cached on the server side using a ResponseManager. Retries for the same request uses the same uuid.

This allows for retries to succeed but commands only run once:

Client makes Request X with Key A to server
Request X succeeds, the server caches Key A, but connection between server to client drops
Client retries Request X-2.0 with Key A to server
Server realizes the Key A is the same and returns cached response to client

Edge case: Concurrent Clients

This fix assumes there is only one client that issues requests to the server at a time. As such, our ResponseManager only caches by saving the latest executed Key + Response pair. If multiple concurrent clients is a use case SWE-Rex often sees, we can add a follow-up with a more complex ResponseManager that handles complex caching.

Testing

To test this behavior, we manually injected random network errors and ensured that the implementation works.

Copilot

Pull Request Overview

This PR adds client-side retry logic with idempotency keys and server-side caching of responses to prevent duplicate executions.

Introduces a ResponseManager and an handle_request_id middleware in server.py to replay cached responses for repeated requests.
Refactors the remote runtime in remote.py to use aiohttp with exponential backoff retry, UUID-based idempotency headers, and async I/O.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
src/swerex/server.py	Added `ResponseManager` and `handle_request_id` middleware to cache and replay request responses.
src/swerex/runtime/remote.py	Switched to `aiohttp`, implemented retry logic with backoff, idempotency headers, and async methods.

Comments suppressed due to low confidence (2)

src/swerex/server.py:54

Accessing and modifying shared state in ResponseManager without synchronization can cause race conditions under concurrent requests. Consider using an asyncio.Lock or thread-safe structure.

    def get_response(self, request_id):

src/swerex/runtime/remote.py:5

The sys import is unused and can be removed to clean up the module.

import sys

src/swerex/server.py

src/swerex/runtime/remote.py

klieret · 2025-06-26T16:11:12Z

This all looks good to me! Generally, we start one deployment per swe-agent instance, so the requests should be synchronous, so I don't think we necessarily need to generalize the caching (though it seems easy enough to do with what you've built so far already!).

I've looked through the changes and couldn't find an immediate problem, but this definitely doesn't run for me at the moment:

If I'm on main, I can very quickly do

 pytest tests/test_execution.py::test_server_alive

However, on this branch, it first fails, and then cleanup takes half a minute:

FAILED tests/test_execution.py::test_server_alive - AssertionError: assert IsAliveResponse(is_alive=False, message='Failed to connect to http://127.0.0.1\nTraceback (most recent call last):\n  ...er_exception\n    raise exception from None\nAttributeError: \'async_generator\' object has no attribute \'encode\'\n')
ERROR tests/test_execution.py::test_server_alive - RuntimeError: Event loop is closed

I can continue to investigate that later, but would also greatly appreciate it if you could take a look

klieret · 2025-06-26T16:15:43Z

~~OK, on second look, this might actually have to do more with problems in conftest.py.~~

klieret · 2025-06-26T16:41:02Z

I added two commits here: #212 that seem to make this work, but it's very very slow on the CI runner. Not sure what's going on there...

for more information, see https://pre-commit.ci

This mimicks the prior behavior of the requests library and makes it more resilient to the server closing client connections

for more information, see https://pre-commit.ci

klieret · 2025-08-06T01:03:18Z

This seems to have auto-closed because it had the SWE-agent:rebased-async-remote-runtime branch as a target and I just merged that one in (and the merge head auto-delete is enabledin this reo).

klieret · 2025-08-06T01:40:47Z

@joyliu-q Merged it all via #233 . There was still one problem that I fixed in a6151ec

klieret · 2025-08-06T01:42:58Z

Oh and I just realize that we don't actually use _request_with_retry? That's strange, maybe something got lost here?

klieret · 2025-08-06T02:04:25Z

#234 should bring the retry logic into the normal _request method

klieret · 2025-08-06T13:59:00Z

Closing this because all the commits should now be included :)

Thank you so much again ❤️ And sorry for all the delays!

joyliu-q force-pushed the joy/add-retry-logic branch 3 times, most recently from 5745f3a to 019cc08 Compare June 23, 2025 15:46

joyliu-q marked this pull request as ready for review June 24, 2025 19:51

klieret requested a review from Copilot June 25, 2025 13:54

Copilot AI reviewed Jun 25, 2025

View reviewed changes

This was referenced Jun 25, 2025

Make remote runtime fully async with aiohttp #197

Closed

Open files not closed in remote.py #209

Closed

saltzm and others added 6 commits July 1, 2025 14:30

Make remote runtime fully async with aiohttp

5f2d9e9

[pre-commit.ci] auto fixes from pre-commit.com hooks

b98c32a

for more information, see https://pre-commit.ci

CI: Properly cleanup sessions

de0766f

Recreate session object for every request

3b78fe4

This mimicks the prior behavior of the requests library and makes it more resilient to the server closing client connections

Properly close session in tests

818d7d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

676fb02

for more information, see https://pre-commit.ci

joyliu-q changed the base branch from main to rebased-async-remote-runtime July 14, 2025 16:35

joyliu-q added 4 commits July 14, 2025 16:38

🎉 Add retry logic to SWE-Rex

42b9eac

📝 Update wording

06ea2a6

📝 Address comments

82dc4e1

🎨 Address copilot comments

fd6b595

joyliu-q force-pushed the joy/add-retry-logic branch from 3a29a62 to fd6b595 Compare July 14, 2025 16:40

klieret deleted the branch SWE-agent:main August 6, 2025 00:58

klieret closed this Aug 6, 2025

klieret reopened this Aug 6, 2025

klieret changed the base branch from rebased-async-remote-runtime to main August 6, 2025 01:04

klieret closed this Aug 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry logic to SWE-ReX#204

Add retry logic to SWE-ReX#204
joyliu-q wants to merge 10 commits intoSWE-agent:mainfrom
joyliu-q:joy/add-retry-logic

joyliu-q commented Jun 23, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

klieret commented Jun 26, 2025 •

edited

Loading

Uh oh!

klieret commented Jun 26, 2025 •

edited

Loading

Uh oh!

klieret commented Jun 26, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

joyliu-q commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Retry logic for server failures (server side and remote side)

Edge case: Concurrent Clients

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

klieret commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klieret commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klieret commented Jun 26, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

klieret commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

joyliu-q commented Jun 23, 2025 •

edited

Loading

klieret commented Jun 26, 2025 •

edited

Loading

klieret commented Jun 26, 2025 •

edited

Loading