File tree Expand file tree Collapse file tree 2 files changed +4
-2
lines changed Expand file tree Collapse file tree 2 files changed +4
-2
lines changed Original file line number Diff line number Diff line change @@ -35,6 +35,7 @@ vLLM is fast with:
35
35
- State-of-the-art serving throughput
36
36
- Efficient management of attention key and value memory with ** PagedAttention**
37
37
- Continuous batching of incoming requests
38
+ - Fast model execution with CUDA/HIP graph
38
39
- Quantization: [ GPTQ] ( https://arxiv.org/abs/2210.17323 ) , [ AWQ] ( https://arxiv.org/abs/2306.00978 ) , [ SqueezeLLM] ( https://arxiv.org/abs/2306.07629 )
39
40
- Optimized CUDA kernels
40
41
@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
45
46
- Tensor parallelism support for distributed inference
46
47
- Streaming outputs
47
48
- OpenAI-compatible API server
48
- - Support NVIDIA GPUs and AMD GPUs.
49
+ - Support NVIDIA GPUs and AMD GPUs
49
50
50
51
vLLM seamlessly supports many Hugging Face models, including the following architectures:
51
52
Original file line number Diff line number Diff line change @@ -30,6 +30,7 @@ vLLM is fast with:
30
30
* State-of-the-art serving throughput
31
31
* Efficient management of attention key and value memory with **PagedAttention **
32
32
* Continuous batching of incoming requests
33
+ * Fast model execution with CUDA/HIP graph
33
34
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323 >`_, `AWQ <https://arxiv.org/abs/2306.00978 >`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629 >`_
34
35
* Optimized CUDA kernels
35
36
@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
40
41
* Tensor parallelism support for distributed inference
41
42
* Streaming outputs
42
43
* OpenAI-compatible API server
43
- * Support NVIDIA GPUs and AMD GPUs.
44
+ * Support NVIDIA GPUs and AMD GPUs
44
45
45
46
For more information, check out the following:
46
47
You can’t perform that action at this time.
0 commit comments