llama: Add generic abort to token_decode_internal #10571

kingbri1 · 2024-11-28T20:49:07Z

Mirrored from the commit message:

Aborting a generation is required if a user wants to decode requests sequentially. Otherwise there is a segfault for the second request because the first request is not done yet.

Fortunately, llama.cpp already has a callback to check if a user has aborted with token decode. However, this is only used in the GGML backend for CPU and Metal. Other backends such as CUDA are out of luck.

Therefore, add a backend agnostic check that occurs per batch. This allows users to cancel their requests without having to wait for the entire prompt processing operation to finish.

An example test is trying to decode an 8000 token prompt with a batch of 2048 and aborting. In this case, the abort will be faster since it's being checked every batch instead of after 8000 tokens.

Temporarily solves #10509 and may be related to #6421

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Aborting a generation is required if a user wants to decode requests sequentially. Otherwise there is a segfault for the second request because the first request is not done yet. Fortunately, llama.cpp already has a callback to check if a user has aborted with token decode. However, this is only used in the GGML backend for CPU and Metal. Other backends such as CUDA are out of luck. Therefore, add a backend agnostic check that occurs per batch. This allows users to cancel their requests without having to wait for the entire prompt processing operation to finish. An example test is trying to decode an 8000 token prompt with a batch of 2048 and aborting. In this case, the abort will be faster since it's being checked every batch instead of after 8000 tokens. Signed-off-by: kingbri <[email protected]>

ggerganov · 2024-11-29T17:32:03Z

Aborting a generation is required if a user wants to decode requests sequentially. Otherwise there is a segfault for the second request because the first request is not done yet.

llama_decode is not thread-safe, so your application should not call it on the same llama_context in parallel.

Apart from that, some backends such as CUDA are asynchronous, so this change would not work the way you expect it to work. All the processing will be already submitted by the time the abort callback is called.

kingbri1 · 2024-11-29T19:50:12Z

@ggerganov There seems to be a bit of a misunderstanding, I'm not trying to call llama_decode in parallel here. In fact, I want to avoid doing that. This PR comes with the assumption that there's some sort of locking or queueing mechanism in the calling function/API.

Here's an example of a situation that inspired this PR:

Client 1 sends an 8000 token long prompt
Internal lock is enabled which makes any incoming requests wait
Client 2 sends a 150 token request
Client 1 cancels the 8000 token long request during the processing step
Client 2 now has to wait for client 1's request to be fully processed

Ideally, client 1 should be cancelled during the processing step rather than at the generation step. That way, Client 2 doesn't have to spend extra time and resources waiting for a request that is cancelled anyway.

When looking at possible bottlenecks, the loop while (lctx.sbatch.n_tokens > 0) took the longest time to complete since it iterates per batch. Therefore, I thought adding an escape hatch would make sense.

If there is a better way in solving this, I'd be happy to discuss in the linked issue and close this PR as what you said makes sense as well.

slaren · 2024-11-29T20:09:33Z

I don't think there is a good way to implement the abort callback with CUDA, as far I can tell, there isn't any way to cancel pending operations with CUDA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama: Add generic abort to token_decode_internal #10571

llama: Add generic abort to token_decode_internal #10571

Uh oh!

kingbri1 commented Nov 28, 2024

Uh oh!

ggerganov commented Nov 29, 2024

Uh oh!

kingbri1 commented Nov 29, 2024

Uh oh!

slaren commented Nov 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

llama: Add generic abort to token_decode_internal #10571

Are you sure you want to change the base?

llama: Add generic abort to token_decode_internal #10571

Uh oh!

Conversation

kingbri1 commented Nov 28, 2024

Uh oh!

ggerganov commented Nov 29, 2024

Uh oh!

kingbri1 commented Nov 29, 2024

Uh oh!

slaren commented Nov 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants