File tree Expand file tree Collapse file tree 2 files changed +4
-2
lines changed Expand file tree Collapse file tree 2 files changed +4
-2
lines changed Original file line number Diff line number Diff line change @@ -35,6 +35,7 @@ vLLM is fast with:
35
35
- State-of-the-art serving throughput
36
36
- Efficient management of attention key and value memory with ** PagedAttention**
37
37
- Continuous batching of incoming requests
38
+ - Quantization: [ GPTQ] ( https://arxiv.org/abs/2210.17323 ) , [ AWQ] ( https://arxiv.org/abs/2306.00978 ) , [ SqueezeLLM] ( https://arxiv.org/abs/2306.07629 )
38
39
- Optimized CUDA kernels
39
40
40
41
vLLM is flexible and easy to use with:
@@ -44,7 +45,7 @@ vLLM is flexible and easy to use with:
44
45
- Tensor parallelism support for distributed inference
45
46
- Streaming outputs
46
47
- OpenAI-compatible API server
47
- - Support NVIDIA CUDA and AMD ROCm .
48
+ - Support NVIDIA GPUs and AMD GPUs .
48
49
49
50
vLLM seamlessly supports many Hugging Face models, including the following architectures:
50
51
Original file line number Diff line number Diff line change @@ -30,6 +30,7 @@ vLLM is fast with:
30
30
* State-of-the-art serving throughput
31
31
* Efficient management of attention key and value memory with **PagedAttention **
32
32
* Continuous batching of incoming requests
33
+ * Quantization: `GPTQ <https://arxiv.org/abs/2210.17323 >`_, `AWQ <https://arxiv.org/abs/2306.00978 >`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629 >`_
33
34
* Optimized CUDA kernels
34
35
35
36
vLLM is flexible and easy to use with:
@@ -39,7 +40,7 @@ vLLM is flexible and easy to use with:
39
40
* Tensor parallelism support for distributed inference
40
41
* Streaming outputs
41
42
* OpenAI-compatible API server
42
- * Support NVIDIA CUDA and AMD ROCm .
43
+ * Support NVIDIA GPUs and AMD GPUs .
43
44
44
45
For more information, check out the following:
45
46
You can’t perform that action at this time.
0 commit comments