Skip to content

Commit 757fae5

Browse files
hmelloramd-xiaoyu12
authored andcommitted
[Docs] Move quant supported hardware table to README (vllm-project#23663)
Signed-off-by: Harry Mellor <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
1 parent 8469fff commit 757fae5

File tree

3 files changed

+48
-34
lines changed

3 files changed

+48
-34
lines changed

docs/features/quantization/README.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@ Quantization trades off model precision for smaller memory footprint, allowing l
44

55
Contents:
66

7-
- [Supported Hardware](supported_hardware.md)
87
- [AutoAWQ](auto_awq.md)
98
- [AutoRound](auto_round.md)
109
- [BitsAndBytes](bnb.md)
@@ -19,3 +18,50 @@ Contents:
1918
- [AMD Quark](quark.md)
2019
- [Quantized KV Cache](quantized_kvcache.md)
2120
- [TorchAO](torchao.md)
21+
22+
## Supported Hardware
23+
24+
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
25+
26+
<style>
27+
td:not(:first-child) {
28+
text-align: center !important;
29+
}
30+
td {
31+
padding: 0.5rem !important;
32+
white-space: nowrap;
33+
}
34+
35+
th {
36+
padding: 0.5rem !important;
37+
min-width: 0 !important;
38+
}
39+
40+
th:not(:first-child) {
41+
writing-mode: vertical-lr;
42+
transform: rotate(180deg)
43+
}
44+
</style>
45+
46+
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU | AWS Neuron | Google TPU |
47+
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|--------------|--------------|
48+
| AWQ || ✅︎ | ✅︎ | ✅︎ | ✅︎ || ✅︎ || ✅︎ |||
49+
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ || ✅︎ || ✅︎ |||
50+
| Marlin (GPTQ/AWQ/FP8) ||| ✅︎ | ✅︎ | ✅︎ |||||||
51+
| INT8 (W8A8) || ✅︎ | ✅︎ | ✅︎ | ✅︎ |||| ✅︎ | ✅︎ | ✅︎ |
52+
| FP8 (W8A8) |||| ✅︎ | ✅︎ | ✅︎ |||| ✅︎ ||
53+
| BitBLAS | ✅︎ || ✅︎ | ✅︎ | ✅︎ |||||||
54+
| BitBLAS (GPTQ) ||| ✅︎ | ✅︎ | ✅︎ |||||||
55+
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ |||||||
56+
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ |||||||
57+
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ ||||||
58+
| INC (W8A8) |||||||| ✅︎ ||||
59+
60+
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
61+
- ✅︎ indicates that the quantization method is supported on the specified hardware.
62+
- ❌ indicates that the quantization method is not supported on the specified hardware.
63+
64+
!!! note
65+
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
66+
67+
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.

docs/features/quantization/bitblas.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
55
!!! note
66
Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
77
Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
8-
For details see [supported hardware](supported_hardware.md).
8+
For details see [supported hardware](README.md#supported-hardware).
99

1010
Below are the steps to utilize BitBLAS with vLLM.
1111

docs/features/quantization/supported_hardware.md

Lines changed: 0 additions & 32 deletions
This file was deleted.

0 commit comments

Comments
 (0)