Skip to content

Commit d8b1ea4

Browse files
[doc] Update Inference Readme (#5736)
* [doc] update inference readme * add contents * trivial
1 parent bdf9a00 commit d8b1ea4

File tree

1 file changed

+148
-93
lines changed

1 file changed

+148
-93
lines changed

colossalai/inference/README.md

Lines changed: 148 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -5,75 +5,28 @@
55
- [⚡️ ColossalAI-Inference](#️-colossalai-inference)
66
- [📚 Table of Contents](#-table-of-contents)
77
- [📌 Introduction](#-introduction)
8-
- [🛠 Design and Implementation](#-design-and-implementation)
98
- [🕹 Usage](#-usage)
10-
- [🪅 Support Matrix](#-support-matrix)
119
- [🗺 Roadmap](#-roadmap)
10+
- [🪅 Support Matrix](#-support-matrix)
11+
- [🛠 Design and Components](#-design-and-components)
12+
- [Overview](#overview)
13+
- [Engine](#engine)
14+
- [Blocked KV Cache Manager](#kv-cache)
15+
- [Batching](#batching)
16+
- [Modeling](#modeling)
1217
- [🌟 Acknowledgement](#-acknowledgement)
1318

1419

1520
## 📌 Introduction
1621
ColossalAI-Inference is a module which offers acceleration to the inference execution of Transformers models, especially LLMs. In ColossalAI-Inference, we leverage high-performance kernels, KV cache, paged attention, continous batching and other techniques to accelerate the inference of LLMs. We also provide simple and unified APIs for the sake of user-friendliness.
1722

18-
## 🛠 Design and Implementation
19-
20-
### :book: Overview
21-
22-
ColossalAI-Inference has **4** major components, namely namely `engine`,`request handler`,`cache manager`, and `modeling`.
23-
24-
- **Engine**: It orchestrates the inference step. During inference, it recives a request, calls `request handler` to schedule a decoding batch, and executes the model forward pass to perform a iteration. It returns the inference results back to the user at the end.
25-
- **Request Handler**: It manages requests and schedules a proper batch from exisiting requests.
26-
- **Cache manager** It is bound within the `request handler`, updates cache blocks and logical block tables as scheduled by the `request handler`.
27-
- **Modelling**: We rewrite the model and layers of LLMs to simplify and optimize the forward pass for inference.
28-
29-
30-
A high-level view of the inter-component interaction is given below. We would also introduce more details in the next few sections.
31-
32-
<p align="center">
33-
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/Introduction.png" width="600"/>
34-
<br/>
35-
</p>
36-
37-
### :mailbox_closed: Engine
38-
Engine is designed as the entry point where the user kickstarts an inference loop. User can easily instantialize an inference engine with the inference configuration and execute requests. The engine object will expose the following APIs for inference:
39-
40-
- `generate`: main function which handles inputs, performs inference and returns outputs
41-
- `add_request`: add request to the waiting list
42-
- `step`: perform one decoding iteration. The `request handler` first schedules a batch to do prefill/decoding. Then, it invokes a model to generate a batch of token and afterwards does logit processing and sampling, checks and decodes finished requests.
43-
44-
### :game_die: Request Handler
45-
46-
Request handler is responsible for managing requests and scheduling a proper batch from exisiting requests. According to the existing work and experiments, we do believe that it is beneficial to increase the length of decoding sequences. In our design, we partition requests into three priorities depending on their lengths, the longer sequences are first considered.
47-
48-
<p align="center">
49-
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/Request_handler.svg" width="800"/>
50-
<br/>
51-
</p>
52-
53-
### :radio: KV cache and cache manager
54-
55-
We design a unified block cache and cache manager to allocate and manage memory. The physical memory is allocated before decoding and represented by a logical block table. During decoding process, cache manager administrates the physical memory through `block table` and other components(i.e. engine) can focus on the lightweight `block table`. More details are given below.
56-
57-
- `cache block`: We group physical memory into different memory blocks. A typical cache block is shaped `(num_kv_heads, head_size, block_size)`. We determine the block number beforehand. The memory allocation and computation are executed at the granularity of memory block.
58-
- `block table`: Block table is the logical representation of cache blocks. Concretely, a block table of a single sequence is a 1D tensor, with each element holding a block ID. Block ID of `-1` means "Not Allocated". In each iteration, we pass through a batch block table to the corresponding model.
59-
60-
<figure>
61-
<p align="center">
62-
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/BlockTable.svg"/>
63-
<br/>
64-
<figcation>Example of Batch Block Table</figcation>
65-
</p>
66-
</figure>
67-
68-
69-
### :railway_car: Modeling
70-
71-
Modeling contains models and layers, which are hand-crafted for better performance easier usage. Deeply integrated with `shardformer`, we also construct policy for our models. In order to minimize users' learning costs, our models are aligned with [Transformers](https://github.com/huggingface/transformers)
7223

7324
## 🕹 Usage
7425

7526
### :arrow_right: Quick Start
7627

28+
The sample usage of the inference engine is given below:
29+
7730
```python
7831
import torch
7932
import transformers
@@ -95,7 +48,6 @@ inference_config = InferenceConfig(
9548
max_input_len=1024,
9649
max_output_len=512,
9750
use_cuda_kernel=True,
98-
use_cuda_graph=False, # Turn on if you want to use CUDA Graph to accelerate inference
9951
)
10052

10153
# Step 3: create an engine with model and config
@@ -107,63 +59,168 @@ response = engine.generate(prompts)
10759
pprint(response)
10860
```
10961

62+
You could run the sample code by
63+
```bash
64+
colossalai run --nproc_per_node 1 your_sample_name.py
65+
```
66+
67+
For detailed examples, you might want to check [inference examples](../../examples/inference/llama/README.md).
68+
11069
### :bookmark: Customize your inference engine
111-
Besides the basic quick-start inference, you can also customize your inference engine via modifying config or upload your own model or decoding components (logit processors or sampling strategies).
70+
Besides the basic quick-start inference, you can also customize your inference engine via modifying inference config or uploading your own models, policies, or decoding components (logits processors or sampling strategies).
11271

11372
#### Inference Config
114-
Inference Config is a unified api for generation process. You can define the value of args to control the generation, like `max_batch_size`,`max_output_len`,`dtype` to decide the how many sequences can be handled at a time, and how many tokens to output. Refer to the source code for more detail.
73+
Inference Config is a unified config for initializing the inference engine, controlling multi-GPU generation (Tensor Parallelism), as well as presetting generation configs. Below are some commonly used `InferenceConfig`'s arguments:
74+
75+
- `max_batch_size`: The maximum batch size. Defaults to 8.
76+
- `max_input_len`: The maximum input length (number of tokens). Defaults to 256.
77+
- `max_output_len`: The maximum output length (number of tokens). Defaults to 256.
78+
- `dtype`: The data type of the model for inference. This can be one of `fp16`, `bf16`, or `fp32`. Defaults to `fp16`.
79+
- `kv_cache_dtype`: The data type used for KVCache. Defaults to the same data type as the model (`dtype`). KVCache quantization will be automatically enabled if it is different from that of model (`dtype`).
80+
- `use_cuda_kernel`: Determine whether to use CUDA kernels or not. If disabled, Triton kernels will be used. Defaults to False.
81+
- `tp_size`: Tensor-Parallelism size. Defaults to 1 (tensor parallelism is turned off by default).
11582

11683
#### Generation Config
117-
In colossal-inference, Generation config api is inherited from [Transformers](https://github.com/huggingface/transformers). Usage is aligned. By default, it is automatically generated by our system and you don't bother to construct one. If you have such demand, you can also create your own and send it to your engine.
84+
Refer to transformers [GenerationConfig](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig) on functionalities and usage of specific configs. In ColossalAI-Inference, generation configs can be preset in `InferenceConfig`. Supported generation configs include:
85+
86+
- `do_sample`: Whether or not to use sampling. Defaults to False (greedy decoding).
87+
- `top_k`: The number of highest probability vocabulary tokens to keep for top-k-filtering. Defaults to 50.
88+
- `top_p`: If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Defaults to 1.0.
89+
- `temperature`: The value used to modulate the next token probabilities. Defaults to 1.0.
90+
- `no_repeat_ngram_size`: If set to int > 0, all ngrams of that size can only occur once. Defaults to 0.
91+
- `repetition_penalty`: The parameter for repetition penalty. 1.0 means no penalty. Defaults to 1.0.
92+
- `forced_eos_token_id`: The id of the token to force as the last generated token when max_length is reached. Defaults to `None`.
11893

119-
#### Logit Processors
120-
The `Logit Processosr` receives logits and return processed results. You can take the following step to make your own.
94+
Users can also create a transformers [GenerationConfig](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig) as an input argument for `InferenceEngine.generate` API. For example
12195

12296
```python
123-
@register_logit_processor("name")
124-
def xx_logit_processor(logits, args):
125-
logits = do_some_process(logits)
126-
return logits
97+
generation_config = GenerationConfig(
98+
max_length=128,
99+
do_sample=True,
100+
temperature=0.7,
101+
top_k=50,
102+
top_p=1.0,
103+
)
104+
response = engine.generate(prompts=prompts, generation_config=generation_config)
127105
```
128106

129-
#### Sampling Strategies
130-
We offer 3 main sampling strategies now (i.e. `greedy sample`, `multinomial sample`, `beam_search sample`), you can refer to [sampler](/ColossalAI/colossalai/inference/sampler.py) for more details. We would strongly appreciate if you can contribute your varities.
107+
## 🗺 Roadmap
108+
109+
We will follow the following roadmap to develop major features of ColossalAI-Inference:
110+
111+
- [x] Blocked KV Cache
112+
- [x] Paged Attention
113+
- 🟩 Fused Kernels
114+
- [x] Speculative Decoding
115+
- [x] Continuous Batching
116+
- 🟩 Tensor Parallelism
117+
- [ ] Online Inference
118+
- [ ] Beam Search
119+
- [ ] SplitFuse
120+
121+
Notations:
122+
- [x] Completed
123+
- 🟩 Model specific and in still progress.
131124

132125
## 🪅 Support Matrix
133126

134-
| Model | KV Cache | Paged Attention | Kernels | Tensor Parallelism | Speculative Decoding |
135-
| - | - | - | - | - | - |
136-
| Llama |||| 🔜 ||
127+
| Model | Model Card | Tensor Parallel | Lazy Initialization | Paged Attention | Fused Kernels | Speculative Decoding |
128+
|-----------|------------------------------------------------------------------------------------------------|-----------------|---------------------|-----------------|---------------|----------------------|
129+
| Baichuan | `baichuan-inc/Baichuan2-7B-Base`,<br> `baichuan-inc/Baichuan2-13B-Base`, etc || [ ] ||| [ ] |
130+
| ChatGLM | | [ ] | [ ] | [ ] | [ ] | [ ] |
131+
| DeepSeek | | [ ] | [ ] | [ ] | [ ] | [ ] |
132+
| Llama | `meta-llama/Llama-2-7b`,<br> `meta-llama/Llama-2-13b`,<br> `meta-llama/Meta-Llama-3-8B`,<br> `meta-llama/Meta-Llama-3-70B`, etc || [ ] ||||
133+
| Mixtral | | [ ] | [ ] | [ ] | [ ] | [ ] |
134+
| Qwen | | [ ] | [ ] | [ ] | [ ] | [ ] |
135+
| Vicuna | `lmsys/vicuna-13b-v1.3`,<br> `lmsys/vicuna-7b-v1.5` || [ ] ||||
136+
| Yi | `01-ai/Yi-34B`, etc || [ ] ||||
137137

138138

139-
Notations:
140-
- ✅: supported
141-
- ❌: not supported
142-
- 🔜: still developing, will support soon
139+
## 🛠 Design and Components
143140

144-
## 🗺 Roadmap
141+
### Overview
142+
143+
ColossalAI-Inference has **4** major components, namely `engine`, `request handler`, `kv cache manager`, and `modeling`.
144+
145+
<p align="center">
146+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/colossalai-inference-overview-abstract.png" alt="colossalai-inference-components-overview" width="600" />
147+
<br/>
148+
</p>
149+
150+
- **Engine**: It orchestrates the inference step. During inference, it recives a request, calls `request handler` to schedule a decoding batch, and executes the model forward pass to perform a iteration. It returns the inference results back to the user at the end.
151+
- **Request Handler**: It manages requests and schedules a proper batch from exisiting requests.
152+
- **KV Cache Manager** It is bound within the `request handler`, updates cache blocks and logical block tables as scheduled by the `request handler`.
153+
- **Modelling**: We rewrite the model and layers of LLMs to simplify and optimize the forward pass for inference.
154+
155+
156+
An overview of the inter-component interaction is given below (RPC version). We would also introduce more details in the next few sections.
157+
158+
<p align="center">
159+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/colossalai-inference-framework.png" alt="colossalai-inference-framework-rpc" width="600"/>
160+
<br/>
161+
</p>
162+
163+
### Engine
164+
165+
Engine is designed as the entry point where the user kickstarts an inference loop. User can easily initialize an inference engine with the inference configurations and execute with their requests. We provided several versions of inference engines, namely `InferenceEngine`, `RPCInferenceEngine`, and `AsyncInferenceEngine`, which are used for different conditions and purposes.
166+
167+
For examples/inference/llama and `RPCInferenceEngine`, we expose the following APIs for inference:
168+
169+
- `generate`: main function which handles inputs, performs inference and returns outputs.
170+
- `add_request`: add a single or multiple requests to the inference engine.
171+
- `step`: perform one decoding iteration. The `request handler` first schedules a batch to do prefill/decoding. Then, it invokes a model to generate a batch of token and afterwards does logit processing and sampling, checks and decodes finished requests.
172+
- `enable_spec_dec`: used for speculative decoding. Enable speculative decoding for subsequent generations.
173+
- `disable_spec_dec`: used for speculative decoding. Disable speculative decoding for subsequent generations
174+
- `clear_spec_dec`: clear structures and models related to speculative decoding, if exists.
175+
176+
For `AsyncInferenceEngine`, we expose the following APIs for inference:
177+
- `add_request`: async method. Add a request to the inference engine, as well as to the waiting queue of the background tracker.
178+
- `generate`: async method. Perform inference from a request.
179+
- `step`: async method. Perform one decoding iteration, if there exists any request in waiting queue.
180+
181+
For now, `InferenceEngine` is used for offline generation; `AsyncInferenceEngine` is used for online serving with a single card; and `RPCInferenceEngine` is used for online serving with multiple cards. In future, we will focus on `RPCInferenceEngine` and improve user experience of LLM serving.
182+
183+
184+
### KV cache
185+
186+
Learnt from [PagedAttention](https://arxiv.org/abs/2309.06180) by [vLLM](https://github.com/vllm-project/vllm) team, we use a unified blocked KV cache and cache manager to allocate and manage memory. The physical memory is pre-allocated during initialization and represented by a logical block table. During decoding process, cache manager administrates the physical memory through `block table` of a batch and so that other components (i.e. engine) can focus on the lightweight `block table`. More details are given below.
187+
188+
- `logical cache block`: We group physical memory into different memory blocks. A typical cache block is shaped `(num_kv_heads, block_size, head_size)`. We determine the block number beforehand. The memory allocation and computation are executed at the granularity of memory block.
189+
- `block table`: Block table is the logical representation of cache blocks. Concretely, a block table of a single sequence is a 1D tensor, with each element holding a block ID. Block ID of `-1` means "Not Allocated". In each iteration, we pass through a batch block table to the corresponding model.
190+
191+
<p align="center">
192+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/Structure/BlockTable.svg"/>
193+
<br/>
194+
<em>Example of block table for a batch</em>
195+
</p>
196+
197+
198+
### Batching
199+
200+
Request handler is responsible for managing requests and scheduling a proper batch from exisiting requests. Based on [Orca's](https://www.usenix.org/conference/osdi22/presentation/yu) and [vLLM's](https://github.com/vllm-project/vllm) research and work on batching requests, we applied continuous batching with unpadded sequences, which enables various number of sequences to pass projections (i.e. Q, K, and V) together in different steps by hiding the dimension of number of sequences, and decrement the latency of incoming sequences by inserting a prefill batch during a decoding step and then decoding together.
201+
202+
<p align="center">
203+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/continuous_batching.png" width="800"/>
204+
<br/>
205+
<em>Naive Batching: decode until each sequence encounters eos in a batch</em>
206+
</p>
207+
208+
<p align="center">
209+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/naive_batching.png" width="800"/>
210+
<br/>
211+
<em>Continuous Batching: dynamically adjust the batch size by popping out finished sequences and inserting prefill batch</em>
212+
</p>
213+
214+
### Modeling
215+
216+
Modeling contains models, layers, and policy, which are hand-crafted for better performance easier usage. Integrated with `shardformer`, users can define their own policy or use our preset policies for specific models. Our modeling files are aligned with [Transformers](https://github.com/huggingface/transformers). For more details about the usage of modeling and policy, please check `colossalai/shardformer`.
145217

146-
- [x] KV Cache
147-
- [x] Paged Attention
148-
- [x] High-Performance Kernels
149-
- [x] Llama Modelling
150-
- [x] User Documentation
151-
- [x] Speculative Decoding
152-
- [ ] Tensor Parallelism
153-
- [ ] Beam Search
154-
- [ ] Early stopping
155-
- [ ] Logger system
156-
- [ ] SplitFuse
157-
- [ ] Continuous Batching
158-
- [ ] Online Inference
159-
- [ ] Benchmarking
160218

161219
## 🌟 Acknowledgement
162220

163221
This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include
164222

165223
- [vLLM](https://github.com/vllm-project/vllm)
166-
- [LightLLM](https://github.com/ModelTC/lightllm)
167224
- [flash-attention](https://github.com/Dao-AILab/flash-attention)
168225

169226
If you wish to cite relevant research papars, you can find the reference below.
@@ -189,6 +246,4 @@ If you wish to cite relevant research papars, you can find the reference below.
189246
author={Dao, Tri},
190247
year={2023}
191248
}
192-
193-
# we do not find any research work related to lightllm
194249
```

0 commit comments

Comments
 (0)