Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Nov 21, 2025

This PR updates the input-preparation and bookkeeping logic for the V2 model runner. The primary goal is to shift as much work as possible onto the GPU, laying the groundwork for async scheduling + speculative decoding.

Key changes

  • num_computed_tokens is now GPU-only.
    The CPU no longer tracks this value. Instead, it maintains num_computed_prefill_tokens, which is used to track the progress of chunked prefills.

  • Prefill and decode input preparation are now decoupled.

    • Prefill: inputs and their statuses (e.g., num_computed_prefill_tokens) are read from NumPy arrays.
    • Decode: uses last_sampled_tokens and num_computed_tokens directly from the GPU.
  • Preparation for speculative decoding.
    We now maintain num_sampled_tokens as GPU tensors for upcoming spec-decode support (where the number of accepted tokens are dynamic).

Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Comment on lines +473 to +480
# Get num_computed_tokens.
# HACK(woosuk): Here, we use num_computed_tokens on GPU instead of
# num_computed_tokens_cpu. This works for most cases.
num_computed_tokens = self.req_states.num_computed_tokens[idx_mapping]
# HACK(woosuk): Only GPU has the exact seq_lens because at this point
# CPU does not know how many draft tokens are accepted/rejected in the
# previous step. Therefore, we use max_model_len to be safe.
seq_lens_np = np.full(num_reqs, self.max_model_len, dtype=np.int32)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucasWilkinson This is my current hack, which is totally undesirable. I plan to use a tighter upper bound for seq_lens_np, but I'd like to keep it if the refactoring may happen in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants