Skip to content

Commit 0fc280b

Browse files
authored
Update the adding-model doc according to the new refactor (#1692)
1 parent 20d0699 commit 0fc280b

File tree

1 file changed

+13
-10
lines changed

1 file changed

+13
-10
lines changed

docs/source/models/adding_model.rst

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -62,31 +62,34 @@ Next, you need to rewrite the :code:`forward` methods of your model by following
6262
+) -> SamplerOutput:
6363
6464
3. Update the code by considering that :code:`input_ids` and :code:`positions` are now flattened tensors.
65-
4. Replace the attention operation with either :code:`GPTPagedAttention` or :code:`GPTNeoXPagedAttention`, depending on the model's architecture.
65+
4. Replace the attention operation with either :code:`PagedAttention`, :code:`PagedAttentionWithRoPE`, or :code:`PagedAttentionWithALiBi` depending on the model's architecture.
6666

6767
.. note::
6868
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
6969
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
7070

7171

72-
3. (Optional) Implement tensor parallelism support
73-
--------------------------------------------------
72+
3. (Optional) Implement tensor parallelism and quantization support
73+
-------------------------------------------------------------------
7474

7575
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
7676
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
77-
For the embedding layer, you can simply replace :code:`nn.Embedding` with :code:`VocabParallelEmbedding`.
78-
When it comes to the linear layers, you should use either :code:`RowParallelLinear` or :code:`ColumnParallelLinear`.
79-
Typically, :code:`ColumnParallelLinear` is used for QKV linear layers and the first linear layers of the MLP blocks.
80-
For the remaining linear layers, :code:`RowParallelLinear` is used.
77+
For the embedding layer, you can simply replace :code:`nn.Embedding` with :code:`VocabParallelEmbedding`. For the output LM head, you can use :code:`ParallelLMHead`.
78+
When it comes to the linear layers, we provide the following options to parallelize them:
8179

80+
* :code:`ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
81+
* :code:`RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
82+
* :code:`ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
83+
* :code:`MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
84+
* :code:`QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
85+
86+
Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
8287

8388
4. Implement the weight loading logic
8489
-------------------------------------
8590

8691
You now need to implement the :code:`load_weights` method in your :code:`*ForCausalLM` class.
87-
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model.
88-
While the process is straightforward for most layers, the tensor-parallel layers necessitate some additional care as their weights should be partitioned to multiple GPUs.
89-
92+
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
9093

9194
5. Register your model
9295
----------------------

0 commit comments

Comments
 (0)