Skip to content

Commit 246598a

Browse files
rkooo567DarkLight1337ywang96
authored
[CI] docfix (#5410)
Co-authored-by: DarkLight1337 <[email protected]> Co-authored-by: ywang96 <[email protected]>
1 parent 8bab495 commit 246598a

File tree

2 files changed

+12
-7
lines changed

2 files changed

+12
-7
lines changed

docs/source/models/vlm.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
2020
Currently, the support for vision language models on vLLM has the following limitations:
2121

2222
* Only single image input is supported per text prompt.
23-
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the huggingface implementation.
23+
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means model output might not exactly match the HuggingFace implementation.
24+
2425
We are continuously improving user & developer experience for VLMs. Please raise an issue on GitHub if you have any feedback or feature requests.
2526

2627
Offline Batched Inference

docs/source/quantization/fp8.rst

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ The FP8 types typically supported in hardware have two distinct representations,
1313
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- ``inf``, and ``nan``. The tradeoff for the increased dynamic range is lower precision of the stored values.
1414

1515
Quick Start with Online Dynamic Quantization
16-
-------------------------------------
16+
--------------------------------------------
1717

1818
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying ``--quantization="fp8"`` in the command line or setting ``quantization="fp8"`` in the LLM constructor.
1919

@@ -173,30 +173,34 @@ Here we detail the structure for the FP8 checkpoints.
173173

174174
The following is necessary to be present in the model's ``config.json``:
175175

176-
.. code-block:: yaml
176+
.. code-block:: text
177+
177178
"quantization_config": {
178179
"quant_method": "fp8",
179180
"activation_scheme": "static" or "dynamic"
180-
},
181+
}
181182
182183
183184
Each quantized layer in the state_dict will have these tensors:
184185

185-
* If the config has `"activation_scheme": "static"`:
186+
* If the config has ``"activation_scheme": "static"``:
186187

187188
.. code-block:: text
189+
188190
model.layers.0.mlp.down_proj.weight < F8_E4M3
189191
model.layers.0.mlp.down_proj.input_scale < F32
190192
model.layers.0.mlp.down_proj.weight_scale < F32
191193
192-
* If the config has `"activation_scheme": "dynamic"`:
194+
* If the config has ``"activation_scheme": "dynamic"``:
193195

194196
.. code-block:: text
197+
195198
model.layers.0.mlp.down_proj.weight < F8_E4M3
196199
model.layers.0.mlp.down_proj.weight_scale < F32
197200
198201
199202
Additionally, there can be `FP8 kv-cache scaling factors <https://github.com/vllm-project/vllm/pull/4893>`_ contained within quantized checkpoints specified through the ``.kv_scale`` parameter present on the Attention Module, such as:
200203

201204
.. code-block:: text
202-
model.layers.0.self_attn.kv_scale < F32
205+
206+
model.layers.0.self_attn.kv_scale < F32

0 commit comments

Comments
 (0)