Add last_token_pos in llama_transformer (pytorch#11793)

jinook-song-meta · facebook-github-bot · commit 6273e8980f9f · 2025-07-04T11:50:37.000-07:00
Summary:

Add last_token_pos in the forward options.

Purpose:
* the last norm and output of lm-head can be performed with the last valid token at prefill.
* If the input sequence length is fixed when an accelerator doesn't support the dynamic shapes, selecting the last token from the input is not always guaranteed as valid. 
* Thus, it needs an additional pointer to select the last valid token only to perform the last norm and output.

Reviewed By: JacobSzwejbka

Differential Revision: D76440105
diff --git a/examples/models/llama/attention.py b/examples/models/llama/attention.py
@@ -19,6 +19,7 @@ class ForwardOptions(TypedDict, total=False):
     freqs_sin_override: Optional[torch.Tensor]
     in_cache_state: Optional[Any]
     out_cache_state: Optional[Any]
+    last_valid_token_pos: Optional[torch.LongTensor]
 
 
 class Attention(nn.Module, ABC):
diff --git a/examples/models/llama/llama_transformer.py b/examples/models/llama/llama_transformer.py
@@ -204,7 +204,10 @@ def forward(
 
         if not self.generate_full_logits:
             # Only the last logit is used for the new generated token
-            h = h[:, -1, :]
+            if attn_options.get("last_valid_token_pos", None):
+                h = h[:, attn_options.get("last_valid_token_pos"), :]
+            else:
+                h = h[:, -1, :]
 
         h = self.norm(h)