Skip to content

Feature request: Possibility of interrupting requests natively in llama.cpp #314

@DanielusG

Description

@DanielusG

Hello, I often find myself in the situation of testing models or making requests to models that take a long time to reason and answer, and perhaps I no longer need the answer. Since llama.cpp, unless you explicitly specify parallelism (which consumes much more VRAM), the requests are sequentials, so to send a subsequent message I have to:

  • Force‑kill the server (losing any conversation cache and reloading the model in RAM-VRAM) or
  • Wait for the model to finish responding (which may take a long time)

Therefore, in the llama‑swap fork I added the ability to interrupt running requests. The interrupt works by using an AbortController so that the llama‑cpp proxy notices the stop request and terminates the response as soon as possible.

Would it be useful to open a pull request to merge this feature, or is it outside the scope of this project?

Some images of actual implementation:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions