Skip to content

Conversation

@kingbri1
Copy link

Mirrored from the commit message:

Aborting a generation is required if a user wants to decode requests sequentially. Otherwise there is a segfault for the second request because the first request is not done yet.

Fortunately, llama.cpp already has a callback to check if a user has aborted with token decode. However, this is only used in the GGML backend for CPU and Metal. Other backends such as CUDA are out of luck.

Therefore, add a backend agnostic check that occurs per batch. This allows users to cancel their requests without having to wait for the entire prompt processing operation to finish.

An example test is trying to decode an 8000 token prompt with a batch of 2048 and aborting. In this case, the abort will be faster since it's being checked every batch instead of after 8000 tokens.

Temporarily solves #10509 and may be related to #6421

Aborting a generation is required if a user wants to decode requests
sequentially. Otherwise there is a segfault for the second request
because the first request is not done yet.

Fortunately, llama.cpp already has a callback to check if a user
has aborted with token decode. However, this is only used in the GGML
backend for CPU and Metal. Other backends such as CUDA are out of luck.

Therefore, add a backend agnostic check that occurs per batch. This allows
users to cancel their requests without having to wait for the entire
prompt processing operation to finish.

An example test is trying to decode an 8000 token prompt with a batch of 2048
and aborting. In this case, the abort will be faster since it's being
checked every batch instead of after 8000 tokens.

Signed-off-by: kingbri <[email protected]>
@ggerganov
Copy link
Member

Aborting a generation is required if a user wants to decode requests sequentially. Otherwise there is a segfault for the second request because the first request is not done yet.

llama_decode is not thread-safe, so your application should not call it on the same llama_context in parallel.

Apart from that, some backends such as CUDA are asynchronous, so this change would not work the way you expect it to work. All the processing will be already submitted by the time the abort callback is called.

@kingbri1
Copy link
Author

@ggerganov There seems to be a bit of a misunderstanding, I'm not trying to call llama_decode in parallel here. In fact, I want to avoid doing that. This PR comes with the assumption that there's some sort of locking or queueing mechanism in the calling function/API.

Here's an example of a situation that inspired this PR:

  1. Client 1 sends an 8000 token long prompt
  2. Internal lock is enabled which makes any incoming requests wait
  3. Client 2 sends a 150 token request
  4. Client 1 cancels the 8000 token long request during the processing step
  5. Client 2 now has to wait for client 1's request to be fully processed

Ideally, client 1 should be cancelled during the processing step rather than at the generation step. That way, Client 2 doesn't have to spend extra time and resources waiting for a request that is cancelled anyway.

When looking at possible bottlenecks, the loop while (lctx.sbatch.n_tokens > 0) took the longest time to complete since it iterates per batch. Therefore, I thought adding an escape hatch would make sense.

If there is a better way in solving this, I'd be happy to discuss in the linked issue and close this PR as what you said makes sense as well.

@slaren
Copy link
Member

slaren commented Nov 29, 2024

I don't think there is a good way to implement the abort callback with CUDA, as far I can tell, there isn't any way to cancel pending operations with CUDA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants