Skip to content

Cancel button in sandbox; fixes #566#568

Open
andrewmusselman wants to merge 1 commit intomainfrom
cancel-button-566
Open

Cancel button in sandbox; fixes #566#568
andrewmusselman wants to merge 1 commit intomainfrom
cancel-button-566

Conversation

@andrewmusselman
Copy link
Collaborator

Sandbox Cancel Button with Elapsed Timer

Fixes #566

Summary

Adds the ability to cancel a running agent from the Sandbox screen. Users can click a Cancel button to immediately stop waiting for results, and an elapsed timer shows how long the agent has been running.

Changes

Backend — routes.py

  • Thread-based execution: Agent code now runs in a separate thread via run_in_executor(), keeping the main asyncio event loop free to handle other HTTP requests (including cancel) while the agent runs.
  • Cancel endpoint: New POST /agents/cancel-run endpoint. Sets a threading.Event that the main event loop polls every 500ms. When detected, returns HTTP 499 immediately.
  • Disconnect detection: Main event loop also polls req.is_disconnected() every 500ms, so navigating away from the page triggers cancellation automatically.
  • Previous run cancellation: If a user starts a new run while one is already in progress, the previous run is cancelled automatically.
  • Module-level state: _cancel_events: Dict[str, threading.Event] keyed by user ID tracks active runs.

Frontend — agentService.js

  • Signal parameter: runCodeInSandbox() now accepts an AbortController signal, passed to the underlying fetch() call.
  • Cancel method: New cancelRun() fire-and-forget method calls the cancel endpoint with keepalive: true so the request survives page navigation and component unmount.

Frontend — SandboxScreen.jsx

  • Cancel button: Red outlined button with StopIcon, visible only while the agent is running.
  • Elapsed timer: Displays running time in M:SS format, updated every second.
  • AbortController lifecycle: Created on run, aborted on cancel. Passed to agentService.runCodeInSandbox().
  • Unmount cleanup: Component teardown aborts the fetch, calls cancelRun(), and clears the timer interval.
  • Error handling: AbortError caught and displayed as "Agent run was cancelled."

Architecture

User clicks Cancel
    │
    ├─► Frontend: AbortController.abort()  →  kills client-side fetch instantly
    │
    └─► Frontend: agentService.cancelRun()  →  POST /agents/cancel-run
                                                       │
                                                       ▼
                                              Sets threading.Event
                                                       │
                                                       ▼
                                              Main event loop detects it
                                              within 500ms, returns 499

Files Modified

File Changes
routes.py Thread-based execution, cancel endpoint, disconnect detection
agentService.js Signal parameter, cancelRun() method with keepalive: true
SandboxScreen.jsx Cancel button, elapsed timer, AbortController lifecycle

Testing

  1. Run an agent, click Cancel → see "Agent run was cancelled", timer stops.
  2. Run an agent, navigate away → agent cancels via unmount cleanup.
  3. Run an agent, click Run again → previous run auto-cancelled.
  4. Cancel, then immediately start a new run → new run works normally.
  5. Verify docker compose logs -f api shows cancel log lines:
    • >>> cancel-run hit for user local-dev-user
    • Cancel detected for user local-dev-user, returning 499

Type of Change

  • New feature (non-breaking change that adds functionality)

Test Execution

  • All existing tests pass locally

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code where necessary (particularly complex areas)
  • My changes generate no new warnings
  • I have checked that there are no merge conflicts

Screenshots (if applicable)

image image

Known Limitation: Background Thread Continues

Cancellation is cooperative, not preemptive. When a user cancels:

  1. The server responds with HTTP 499 immediately — the user is unblocked.
  2. The event loop is freed — other requests are handled normally.
  3. However, the background thread running the agent continues to completion. Its result is discarded.

This means that in-flight LLM API calls and database operations will finish even after cancellation. The orphaned thread runs to completion and then dies naturally. This is a Python limitation — threads cannot be forcibly terminated.

Practical impact is low for typical usage:

  • The LLM API call has already been sent; cancelling can't reclaim that cost.
  • Database reads (namespace scanning) are lightweight.
  • The thread holds no locks and writes no results after the cancel event is set.
  • Starting a new run works immediately regardless of the orphaned thread.

Future Work: True Process Termination via Multiprocessing

To actually halt in-flight work, the agent would need to run in a child process instead of a thread, since processes can be killed via process.terminate() / process.kill().

What would change

  1. Replace threading + run_in_executor with multiprocessing.Process

    • Spawn a child process for each agent run.
    • Cancel via process.terminate() (sends SIGTERM) or process.kill() (sends SIGKILL).
    • Return results via multiprocessing.Queue.
  2. Bootstrap connections in the child process

    • multiprocessing serializes (pickles) arguments to send to the child. Live objects like DatabaseService, HTTP clients, and closures cannot be pickled.
    • Instead of passing db, pass primitive config values (CouchDB URL, credentials) and have the child process create its own DatabaseService instance.
    • Same for httpx.AsyncClient, LLM API clients, etc.
  3. Make _execute_agent_code self-contained

    • Must import and initialize all dependencies from scratch — no shared state with the parent.
    • exec_globals would be built inside the child process.
  4. Cleanup and resource management

    • process.terminate() + process.join(timeout) on cancel, with process.kill() as fallback.
    • Handle zombie process cleanup.
    • Consider a process pool to limit concurrent agent runs.

Estimated scope

Medium effort (half-day to one day). The main risk is subtle bugs from having independent database connections in the child process, or missing dependencies that exec_globals injects. Testing should cover:

  • Normal run completes and returns results correctly.
  • Cancel terminates the process and returns 499.
  • Rapid cancel-and-rerun doesn't leak processes.
  • Database state is consistent after cancellation mid-write.

When to prioritize this

The current threading approach is sufficient for single-user / low-concurrency usage. Multiprocessing becomes important when:

  • Multiple users run agents concurrently and orphaned threads consume significant resources.
  • LLM API costs are high enough that reclaiming cancelled calls matters (would require upstream API cancellation support as well).
  • Agent runs involve long-running write operations that should be rolled back on cancel.

By submitting this PR, I confirm that I have read and agree to follow the project's Code of Conduct and Contributing Guidelines.

Signed-off-by: Andrew Musselman <akm@apache.org>
@rawkintrevo
Copy link
Contributor

rawkintrevo commented Feb 8, 2026

Thanks for the contrib @andrewmusselman ,

This means that in-flight LLM API calls and database operations will finish even after cancellation. The orphaned thread runs to completion and then dies naturally. This is a Python limitation — threads cannot be forcibly terminated.

From the description it makes it sound like a request can't be cancelled, and this is untrue

https://docs.litellm.ai/docs/response_api#cancel-a-response

The cleaner implementation would be to get a response id for any query- save that then fire the canellation request, only falling back to the oprhaning if that doesn't work. I disagree with whatever said, 'meh, doesn't really matter anyway'

@rawkintrevo
Copy link
Contributor

this also should have a tests with it

@andrewmusselman
Copy link
Collaborator Author

Thanks for the contrib @andrewmusselman ,

This means that in-flight LLM API calls and database operations will finish even after cancellation. The orphaned thread runs to completion and then dies naturally. This is a Python limitation — threads cannot be forcibly terminated.

From the description it makes it sound like a request can't be cancelled, and this is untrue

https://docs.litellm.ai/docs/response_api#cancel-a-response

The cleaner implementation would be to get a response id for any query- save that then fire the canellation request, only falling back to the oprhaning if that doesn't work. I disagree with whatever said, 'meh, doesn't really matter anyway'

Okay thanks, I'll see if we can get that implemented

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add cancel button to sandbox page

2 participants

Comments