You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/models/vlm.rst
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,8 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
20
20
Currently, the support for vision language models on vLLM has the following limitations:
21
21
22
22
* Only single image input is supported per text prompt.
23
-
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
23
+
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
24
+
24
25
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
Copy file name to clipboardExpand all lines: docs/source/quantization/fp8.rst
+10-6Lines changed: 10 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ The FP8 types typically supported in hardware have two distinct representations,
13
13
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
14
14
15
15
Quick Start with Online Dynamic Quantization
16
-
-------------------------------------
16
+
--------------------------------------------
17
17
18
18
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.
19
19
@@ -173,30 +173,34 @@ Here we detail the structure for the FP8 checkpoints.
173
173
174
174
The following is necessary to be present in the model's ``config.json``:
175
175
176
-
.. code-block:: yaml
176
+
.. code-block:: text
177
+
177
178
"quantization_config": {
178
179
"quant_method": "fp8",
179
180
"activation_scheme": "static" or "dynamic"
180
-
},
181
+
}
181
182
182
183
183
184
Each quantized layer in the state_dict will have these tensors:
184
185
185
-
* If the config has `"activation_scheme": "static"`:
186
+
* If the config has ``"activation_scheme": "static"``:
186
187
187
188
.. code-block:: text
189
+
188
190
model.layers.0.mlp.down_proj.weight < F8_E4M3
189
191
model.layers.0.mlp.down_proj.input_scale < F32
190
192
model.layers.0.mlp.down_proj.weight_scale < F32
191
193
192
-
* If the config has `"activation_scheme": "dynamic"`:
194
+
* If the config has ``"activation_scheme": "dynamic"``:
193
195
194
196
.. code-block:: text
197
+
195
198
model.layers.0.mlp.down_proj.weight < F8_E4M3
196
199
model.layers.0.mlp.down_proj.weight_scale < F32
197
200
198
201
199
202
Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
0 commit comments