Feature request: Possibility of interrupting requests natively in llama.cpp

Hello, I often find myself in the situation of testing models or making requests to models that take a long time to reason and answer, and perhaps I no longer need the answer. Since llama.cpp, unless you explicitly specify parallelism (which consumes much more VRAM), the requests are sequentials, so to send a subsequent message I have to:

- Force‑kill the server (losing any conversation cache and reloading the model in RAM-VRAM) or
- Wait for the model to finish responding (which may take a long time)

Therefore, in the llama‑swap fork I added the ability to interrupt running requests. The interrupt works by using an AbortController so that the llama‑cpp proxy notices the stop request and terminates the response as soon as possible.

Would it be useful to open a pull request to merge this feature, or is it outside the scope of this project?

Some images of actual implementation:

<img width="652" height="411" alt="Image" src="https://github.com/user-attachments/assets/3d025eaa-633b-4ccd-9ffd-bb1288ec33f6" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Possibility of interrupting requests natively in llama.cpp #314

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature request: Possibility of interrupting requests natively in llama.cpp #314

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions