Skip to content

Commit 1af090b

Browse files
authored
Bump up version to v0.3.0 (#2656)
1 parent 3dad944 commit 1af090b

File tree

3 files changed

+7
-3
lines changed

3 files changed

+7
-3
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ vLLM is fast with:
4646
- Efficient management of attention key and value memory with **PagedAttention**
4747
- Continuous batching of incoming requests
4848
- Fast model execution with CUDA/HIP graph
49-
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
49+
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629), FP8 KV Cache
5050
- Optimized CUDA kernels
5151

5252
vLLM is flexible and easy to use with:
@@ -57,6 +57,8 @@ vLLM is flexible and easy to use with:
5757
- Streaming outputs
5858
- OpenAI-compatible API server
5959
- Support NVIDIA GPUs and AMD GPUs
60+
- (Experimental) Prefix caching support
61+
- (Experimental) Multi-lora support
6062

6163
vLLM seamlessly supports many Hugging Face models, including the following architectures:
6264

docs/source/index.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ vLLM is fast with:
3131
* Efficient management of attention key and value memory with **PagedAttention**
3232
* Continuous batching of incoming requests
3333
* Fast model execution with CUDA/HIP graph
34-
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_
34+
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
3535
* Optimized CUDA kernels
3636

3737
vLLM is flexible and easy to use with:
@@ -42,6 +42,8 @@ vLLM is flexible and easy to use with:
4242
* Streaming outputs
4343
* OpenAI-compatible API server
4444
* Support NVIDIA GPUs and AMD GPUs
45+
* (Experimental) Prefix caching support
46+
* (Experimental) Multi-lora support
4547

4648
For more information, check out the following:
4749

vllm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from vllm.outputs import CompletionOutput, RequestOutput
99
from vllm.sampling_params import SamplingParams
1010

11-
__version__ = "0.2.7"
11+
__version__ = "0.3.0"
1212

1313
__all__ = [
1414
"LLM",

0 commit comments

Comments
 (0)