You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I often find myself in the situation of testing models or making requests to models that take a long time to reason and answer, and perhaps I no longer need the answer. Since llama.cpp, unless you explicitly specify parallelism (which consumes much more VRAM), the requests are sequentials, so to send a subsequent message I have to:
Force‑kill the server (losing any conversation cache and reloading the model in RAM-VRAM) or
Wait for the model to finish responding (which may take a long time)
Therefore, in the llama‑swap fork I added the ability to interrupt running requests. The interrupt works by using an AbortController so that the llama‑cpp proxy notices the stop request and terminates the response as soon as possible.
Would it be useful to open a pull request to merge this feature, or is it outside the scope of this project?