Skip to content

Commit 2fd12ed

Browse files
committed
Remove unused warning and Add README
1 parent 2584797 commit 2fd12ed

File tree

4 files changed

+191
-144
lines changed

4 files changed

+191
-144
lines changed

README.md

Lines changed: 26 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,132 +1,39 @@
1-
<p align="center">
2-
<picture>
3-
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
4-
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
5-
</picture>
6-
</p>
1+
## HyperDex-vLLM
72

8-
<h3 align="center">
9-
Easy, fast, and cheap LLM serving for everyone
10-
</h3>
3+
HyperDex supports the vLLM framework to run on LPU(LLM Processing Unit). As you know, the vLLM framework officially supports a variety of hardware including GPU, TPU, and XPU. HyperDex has its own branch of vLLM with a backend specifically designed for LPU, making it very easy to use. If your system is already using vLLM, you can switch hardware from GPU to LPU without changing any code. Then, let's jump into the hyperdex-vllm!
114

12-
<p align="center">
13-
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> |
145

15-
</p>
16-
17-
18-
---
19-
20-
**vLLM & NVIDIA Triton User Meetup (Monday, September 9, 5pm-9pm PT) at Fort Mason, San Francisco**
21-
22-
We are excited to announce our sixth vLLM Meetup, in collaboration with NVIDIA Triton Team.
23-
Join us to hear the vLLM's recent update about performance.
24-
Register now [here](https://lu.ma/87q3nvnh) and be part of the event!
25-
26-
---
27-
28-
*Latest News* 🔥
29-
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
30-
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
31-
- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
32-
- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
33-
- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing).
34-
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
35-
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
36-
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
37-
38-
---
39-
## About
40-
vLLM is a fast and easy-to-use library for LLM inference and serving.
41-
42-
vLLM is fast with:
43-
44-
- State-of-the-art serving throughput
45-
- Efficient management of attention key and value memory with **PagedAttention**
46-
- Continuous batching of incoming requests
47-
- Fast model execution with CUDA/HIP graph
48-
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
49-
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
50-
- Speculative decoding
51-
- Chunked prefill
52-
53-
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)).
54-
55-
vLLM is flexible and easy to use with:
56-
57-
- Seamless integration with popular Hugging Face models
58-
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
59-
- Tensor parallelism and pipeline parallelism support for distributed inference
60-
- Streaming outputs
61-
- OpenAI-compatible API server
62-
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
63-
- Prefix caching support
64-
- Multi-lora support
65-
66-
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
67-
- Transformer-like LLMs (e.g., Llama)
68-
- Mixture-of-Expert LLMs (e.g., Mixtral)
69-
- Embedding Models (e.g. E5-Mistral)
70-
- Multi-modal LLMs (e.g., LLaVA)
71-
72-
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
73-
74-
## Getting Started
75-
76-
Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
6+
Requirements
7+
- vLLM.0.5.5
8+
- libtorch.2.4.0
9+
- hyperdex.1.3.2
7710

11+
Installation
7812
```bash
79-
pip install vllm
13+
cd scripts
14+
./install_script.sh
8015
```
8116

82-
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
83-
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
84-
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
85-
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
86-
87-
## Contributing
88-
89-
We welcome and value any contributions and collaborations.
90-
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
17+
Simple Execution using vLLM API
18+
In our branch, you can easily execute LPU by setting the option `device=fpga` and `num_lpu_devices=1`. Try set the option `num_gpu_devices=1` if you want to test hybrid mode.
19+
If you aren't set the option `device(default:cuda)`, vLLM functions like original vLLM.
9120

92-
## Sponsors
93-
94-
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
21+
```bash
22+
cd examples
23+
python lpu_inference.py
24+
```
9525

96-
<!-- Note: Please sort them in alphabetical order. -->
97-
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
9826

99-
- a16z
100-
- AMD
101-
- Anyscale
102-
- AWS
103-
- Crusoe Cloud
104-
- Databricks
105-
- DeepInfra
106-
- Dropbox
107-
- Google Cloud
108-
- Lambda Lab
109-
- NVIDIA
110-
- Replicate
111-
- Roblox
112-
- RunPod
113-
- Sequoia Capital
114-
- Skywork AI
115-
- Trainy
116-
- UC Berkeley
117-
- UC San Diego
118-
- ZhenFund
27+
Execution Serving API
28+
```bash
29+
# Open the serving system
30+
cd examples
31+
./vllm_serve.sh
11932

120-
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
33+
# Send requests for serving system from another terminal
34+
cd examples
35+
python lpu_client.py
36+
```
12137

122-
## Citation
38+
Visit our [website](https://docs.hyperaccel.ai) to learn more.
12339

124-
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
125-
```bibtex
126-
@inproceedings{kwon2023efficient,
127-
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
128-
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
129-
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
130-
year={2023}
131-
}
132-
```

_README.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
<p align="center">
2+
<picture>
3+
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
4+
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
5+
</picture>
6+
</p>
7+
8+
<h3 align="center">
9+
Easy, fast, and cheap LLM serving for everyone
10+
</h3>
11+
12+
<p align="center">
13+
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> |
14+
15+
</p>
16+
17+
18+
---
19+
20+
**vLLM & NVIDIA Triton User Meetup (Monday, September 9, 5pm-9pm PT) at Fort Mason, San Francisco**
21+
22+
We are excited to announce our sixth vLLM Meetup, in collaboration with NVIDIA Triton Team.
23+
Join us to hear the vLLM's recent update about performance.
24+
Register now [here](https://lu.ma/87q3nvnh) and be part of the event!
25+
26+
---
27+
28+
*Latest News* 🔥
29+
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
30+
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
31+
- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
32+
- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
33+
- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing).
34+
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
35+
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
36+
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
37+
38+
---
39+
## About
40+
vLLM is a fast and easy-to-use library for LLM inference and serving.
41+
42+
vLLM is fast with:
43+
44+
- State-of-the-art serving throughput
45+
- Efficient management of attention key and value memory with **PagedAttention**
46+
- Continuous batching of incoming requests
47+
- Fast model execution with CUDA/HIP graph
48+
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
49+
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
50+
- Speculative decoding
51+
- Chunked prefill
52+
53+
**Performance benchmark**: We include a [performance benchmark](https://buildkite.com/vllm/performance-benchmark/builds/4068) that compares the performance of vLLM against other LLM serving engines ([TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [text-generation-inference](https://github.com/huggingface/text-generation-inference) and [lmdeploy](https://github.com/InternLM/lmdeploy)).
54+
55+
vLLM is flexible and easy to use with:
56+
57+
- Seamless integration with popular Hugging Face models
58+
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
59+
- Tensor parallelism and pipeline parallelism support for distributed inference
60+
- Streaming outputs
61+
- OpenAI-compatible API server
62+
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
63+
- Prefix caching support
64+
- Multi-lora support
65+
66+
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
67+
- Transformer-like LLMs (e.g., Llama)
68+
- Mixture-of-Expert LLMs (e.g., Mixtral)
69+
- Embedding Models (e.g. E5-Mistral)
70+
- Multi-modal LLMs (e.g., LLaVA)
71+
72+
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
73+
74+
## Getting Started
75+
76+
Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
77+
78+
```bash
79+
pip install vllm
80+
```
81+
82+
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
83+
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
84+
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
85+
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
86+
87+
## Contributing
88+
89+
We welcome and value any contributions and collaborations.
90+
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
91+
92+
## Sponsors
93+
94+
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
95+
96+
<!-- Note: Please sort them in alphabetical order. -->
97+
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
98+
99+
- a16z
100+
- AMD
101+
- Anyscale
102+
- AWS
103+
- Crusoe Cloud
104+
- Databricks
105+
- DeepInfra
106+
- Dropbox
107+
- Google Cloud
108+
- Lambda Lab
109+
- NVIDIA
110+
- Replicate
111+
- Roblox
112+
- RunPod
113+
- Sequoia Capital
114+
- Skywork AI
115+
- Trainy
116+
- UC Berkeley
117+
- UC San Diego
118+
- ZhenFund
119+
120+
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
121+
122+
## Citation
123+
124+
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
125+
```bibtex
126+
@inproceedings{kwon2023efficient,
127+
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
128+
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
129+
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
130+
year={2023}
131+
}
132+
```

examples/lpu_inference_arg.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
# Get arguments
55
parser = argparse.ArgumentParser(description='vLLM Inference Test Script')
66
parser.add_argument("-m", "--model", default="facebook/opt-1.3b", type=str, help="name of the language model")
7-
parser.add_argument("-n", "--ncore", default=1, type=int, help="the number of the LPU")
7+
parser.add_argument("-l", "--nlpu", default=1, type=int, help="the number of the LPU")
8+
parser.add_argument("-g", "--ngpu", default=0, type=int, help="the number of the GPU")
89
parser.add_argument("-i", "--i_token", default="Hello, my name is", type=str, help="input prompt")
910
parser.add_argument("-o", "--o_token", default=32, type=int, help="the number of output")
1011
args = parser.parse_args()
@@ -13,9 +14,11 @@
1314
prompts = [args.i_token]
1415

1516
# Create a sampling params object and LLM
17+
print(args.i_token)
18+
print(args.o_token)
19+
print(args.nlpu, args.ngpu, args.model)
1620
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, top_k=1, max_tokens=args.o_token)
17-
llm = LLM(model=args.model, device="fpga", tensor_parallel_size=args.ncore)
18-
21+
llm = LLM(model=args.model, device="fpga", num_lpu_devices=args.nlpu, num_gpu_devices=args.ngpu)
1922
# Run and print the outputs.
2023
outputs = llm.generate(prompts, sampling_params)
2124
for output in outputs:

0 commit comments

Comments
 (0)