You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/models/adding_model.rst
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -58,11 +58,10 @@ Next, you need to rewrite the :code:`forward` methods of your model by following
58
58
+ positions: torch.Tensor,
59
59
+ kv_caches: List[KVCache],
60
60
+ input_metadata: InputMetadata,
61
-
+ cache_events: Optional[List[torch.cuda.Event]],
62
-
+) -> SamplerOutput:
61
+
+) -> Optional[SamplerOutput]:
63
62
64
-
3. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
65
-
4. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
63
+
1. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
64
+
2. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
66
65
67
66
.. note::
68
67
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
0 commit comments