File tree Expand file tree Collapse file tree 2 files changed +4
-2
lines changed Expand file tree Collapse file tree 2 files changed +4
-2
lines changed Original file line number Diff line number Diff line change @@ -35,6 +35,7 @@ vLLM is fast with:
3535- State-of-the-art serving throughput
3636- Efficient management of attention key and value memory with ** PagedAttention**
3737- Continuous batching of incoming requests
38+ - Fast model execution with CUDA/HIP graph
3839- Quantization: [ GPTQ] ( https://arxiv.org/abs/2210.17323 ) , [ AWQ] ( https://arxiv.org/abs/2306.00978 ) , [ SqueezeLLM] ( https://arxiv.org/abs/2306.07629 )
3940- Optimized CUDA kernels
4041
@@ -45,7 +46,7 @@ vLLM is flexible and easy to use with:
4546- Tensor parallelism support for distributed inference
4647- Streaming outputs
4748- OpenAI-compatible API server
48- - Support NVIDIA GPUs and AMD GPUs.
49+ - Support NVIDIA GPUs and AMD GPUs
4950
5051vLLM seamlessly supports many Hugging Face models, including the following architectures:
5152
Original file line number Diff line number Diff line change @@ -30,6 +30,7 @@ vLLM is fast with:
3030* State-of-the-art serving throughput
3131* Efficient management of attention key and value memory with **PagedAttention **
3232* Continuous batching of incoming requests
33+ * Fast model execution with CUDA/HIP graph
3334* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323 >`_, `AWQ <https://arxiv.org/abs/2306.00978 >`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629 >`_
3435* Optimized CUDA kernels
3536
@@ -40,7 +41,7 @@ vLLM is flexible and easy to use with:
4041* Tensor parallelism support for distributed inference
4142* Streaming outputs
4243* OpenAI-compatible API server
43- * Support NVIDIA GPUs and AMD GPUs.
44+ * Support NVIDIA GPUs and AMD GPUs
4445
4546For more information, check out the following:
4647
You can’t perform that action at this time.
0 commit comments