Skip to content

Commit 85946d4

Browse files
authored
[Inference]Fix readme and example for API server (#5742)
* fix chatapi readme and example * updating doc * add an api and change the doc * remove * add credits and del 'API' heading * readme * readme
1 parent 4647ec2 commit 85946d4

File tree

5 files changed

+73
-40
lines changed

5 files changed

+73
-40
lines changed

colossalai/inference/README.md

Lines changed: 51 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -207,13 +207,13 @@ Learnt from [PagedAttention](https://arxiv.org/abs/2309.06180) by [vLLM](https:/
207207
Request handler is responsible for managing requests and scheduling a proper batch from exisiting requests. Based on [Orca's](https://www.usenix.org/conference/osdi22/presentation/yu) and [vLLM's](https://github.com/vllm-project/vllm) research and work on batching requests, we applied continuous batching with unpadded sequences, which enables various number of sequences to pass projections (i.e. Q, K, and V) together in different steps by hiding the dimension of number of sequences, and decrement the latency of incoming sequences by inserting a prefill batch during a decoding step and then decoding together.
208208

209209
<p align="center">
210-
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/continuous_batching.png" width="800"/>
210+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/naive_batching.png" width="800"/>
211211
<br/>
212212
<em>Naive Batching: decode until each sequence encounters eos in a batch</em>
213213
</p>
214214

215215
<p align="center">
216-
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/naive_batching.png" width="800"/>
216+
<img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/continuous_batching.png" width="800"/>
217217
<br/>
218218
<em>Continuous Batching: dynamically adjust the batch size by popping out finished sequences and inserting prefill batch</em>
219219
</p>
@@ -222,14 +222,62 @@ Request handler is responsible for managing requests and scheduling a proper bat
222222

223223
Modeling contains models, layers, and policy, which are hand-crafted for better performance easier usage. Integrated with `shardformer`, users can define their own policy or use our preset policies for specific models. Our modeling files are aligned with [Transformers](https://github.com/huggingface/transformers). For more details about the usage of modeling and policy, please check `colossalai/shardformer`.
224224

225+
## Online Service
226+
Colossal-Inference supports fast-api based online service. Simple completion and chat are both supported. Follow the commands below and you can simply construct a server with both completion and chat functionalities. For now we support `Llama2`,`Llama3` and `Baichuan2` model, etc. we will fullfill the blank quickly.
227+
228+
### API
229+
230+
- GET '/ping':
231+
Ping is used to check if the server can receive and send information.
232+
- GET '/engine_check':
233+
Check is the background engine is working.
234+
- POST '/completion':
235+
Completion api is used for single sequence request, like answer a question or complete words.
236+
- POST '/chat':
237+
Chat api is used for conversation-style request, which often includes dialogue participants(i.e. roles) and corresponding words. Considering the input data are very different from normal inputs, we introduce Chat-Template to match the data format in chat models.
238+
#### chat-template
239+
Followed `transformers`, we add the chat-template argument. As chat models have been trained with very different formats for converting conversations into a single tokenizable string. Using a format that matches the training data is extremely important. This attribute(chat_template) is inclueded in HuggingFace tokenizers, containing a Jinja template that converts conversation histories into a correctly formatted string. You can refer to the [HuggingFace-blog](https://huggingface.co/blog/chat-templates) for more information. We also provide a simple example temlate bellow. Both str or file style chat template are supported.
240+
### Usage
241+
#### Args for customizing your server
242+
The configuration for api server contains both serving interface and engine backend.
243+
For Interface:
244+
- `--host`: The host url on your device for the server.
245+
- `--port`: The port for service
246+
- `--model`: The model that backend engine uses, both path and transformers model card are supported.
247+
- `--chat-template` The file path of chat template or the template string.
248+
- `--response-role` The role that colossal-inference plays.
249+
For Engine Backend:
250+
- `--block_size`: The memory usage for each block.
251+
- `--max_batch_size`: The max batch size for engine to infer. This changes the speed of inference,
252+
- `--max_input_len`: The max input length of a request.
253+
- `--max_output_len`: The output length of response.
254+
- `--dtype` and `--use_cuda_kernel`: Deciding the precision and kernel usage.
255+
For more detailed arguments, please refer to source code.
256+
257+
### Examples
258+
```bash
259+
# First, Lauch an API locally.
260+
python3 -m colossalai.inference.server.api_server --model path of your model --chat-template "{% for message in messages %}{{'<|im_start|>'+message['role']+'\n'+message['content']+'<|im_end|>'+'\n'}}{% endfor %}"
261+
262+
# Second, you can turn to the page `http://127.0.0.1:8000/docs` to check the api
263+
264+
# For completion service, you can invoke it
265+
curl -X POST http://127.0.0.1:8000/completion -H 'Content-Type: application/json' -d '{"prompt":"hello, who are you? "}'
266+
267+
# For chat service, you can invoke it
268+
curl -X POST http://127.0.0.1:8000/chat -H 'Content-Type: application/json' -d '{"messages":[{"role":"system","content":"you are a helpful assistant"},{"role":"user","content":"what is 1+1?"}]}'
269+
270+
# You can check the engine status now
271+
curl http://localhost:8000/engine_check
272+
```
225273

226274
## 🌟 Acknowledgement
227275

228276
This project was written from scratch but we learned a lot from several other great open-source projects during development. Therefore, we wish to fully acknowledge their contribution to the open-source community. These projects include
229277

230278
- [vLLM](https://github.com/vllm-project/vllm)
231279
- [flash-attention](https://github.com/Dao-AILab/flash-attention)
232-
280+
- [HuggingFace](https://huggingface.co)
233281
If you wish to cite relevant research papars, you can find the reference below.
234282

235283
```bibtex

colossalai/inference/server/README.md

Lines changed: 0 additions & 27 deletions
This file was deleted.

colossalai/inference/server/api_server.py

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
from colossalai.inference.core.async_engine import AsyncInferenceEngine, InferenceEngine # noqa
3131

3232
TIMEOUT_KEEP_ALIVE = 5 # seconds.
33-
supported_models_dict = {"Llama_Models": ("llama2-7b",)}
3433
prompt_template_choices = ["llama", "vicuna"]
3534
async_engine = None
3635
chat_serving = None
@@ -39,15 +38,25 @@
3938
app = FastAPI()
4039

4140

42-
# NOTE: (CjhHa1) models are still under development, need to be updated
43-
@app.get("/models")
44-
def get_available_models() -> Response:
45-
return JSONResponse(supported_models_dict)
41+
@app.get("/ping")
42+
def health_check() -> JSONResponse:
43+
"""Health Check for server."""
44+
return JSONResponse({"status": "Healthy"})
45+
46+
47+
@app.get("/engine_check")
48+
def engine_check() -> bool:
49+
"""Check if the background loop is running."""
50+
loop_status = async_engine.background_loop_status
51+
if loop_status == False:
52+
return JSONResponse({"status": "Error"})
53+
return JSONResponse({"status": "Running"})
4654

4755

4856
@app.post("/generate")
4957
async def generate(request: Request) -> Response:
5058
"""Generate completion for the request.
59+
NOTE: THIS API IS USED ONLY FOR TESTING, DO NOT USE THIS IF YOU ARE IN ACTUAL APPLICATION.
5160
5261
A request should be a JSON object with the following fields:
5362
- prompts: the prompts to use for the generation.
@@ -133,7 +142,7 @@ def add_engine_config(parser):
133142
# Parallel arguments not supported now
134143

135144
# KV cache arguments
136-
parser.add_argument("--block-size", type=int, default=16, choices=[8, 16, 32], help="token block size")
145+
parser.add_argument("--block_size", type=int, default=16, choices=[16, 32], help="token block size")
137146

138147
parser.add_argument("--max_batch_size", type=int, default=8, help="maximum number of batch size")
139148

examples/inference/client/locustfile.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ def chat(self):
2020
self.client.post(
2121
"/chat",
2222
json={
23-
"converation": [
23+
"messages": [
2424
{"role": "system", "content": "you are a helpful assistant"},
2525
{"role": "user", "content": "what is 1+1?"},
2626
],
@@ -34,14 +34,15 @@ def chat_streaming(self):
3434
self.client.post(
3535
"/chat",
3636
json={
37-
"converation": [
37+
"messages": [
3838
{"role": "system", "content": "you are a helpful assistant"},
3939
{"role": "user", "content": "what is 1+1?"},
4040
],
4141
"stream": "True",
4242
},
4343
)
4444

45+
# offline-generation is only for showing the usage, it will never be used in actual serving.
4546
@tag("offline-generation")
4647
@task(5)
4748
def generate_streaming(self):
@@ -54,5 +55,5 @@ def generate(self):
5455

5556
@tag("online-generation", "offline-generation")
5657
@task
57-
def get_models(self):
58-
self.client.get("/models")
58+
def health_check(self):
59+
self.client.get("/ping")

requirements/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,6 @@ transformers==4.36.2
2020
peft>=0.7.1
2121
bitsandbytes>=0.39.0
2222
rpyc==6.0.0
23+
fastapi
24+
uvicorn==0.29.0
2325
galore_torch

0 commit comments

Comments
 (0)