Skip to content

Commit d4e3a20

Browse files
authored
[Docs] Release 2.1 docs and fix some description (#3424)
1 parent fbb6dcb commit d4e3a20

File tree

14 files changed

+73
-29
lines changed

14 files changed

+73
-29
lines changed

README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
English | [简体中文](README_CN.md)
1+
English | [简体中文](README_CN.md)
22
<p align="center">
33
<a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
44
</p>
@@ -23,9 +23,10 @@ English | [简体中文](README_CN.md)
2323
</p>
2424

2525
--------------------------------------------------------------------------------
26-
# FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
26+
# FastDeploy 2.1: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
2727

2828
## News
29+
**[2025-08] 🔥 Released FastDeploy v2.1:** A brand-new KV Cache scheduling strategy has been introduced, and expanded support for PD separation and CUDA Graph across more models. Enhanced hardware support has been added for platforms like Kunlun and Hygon, along with comprehensive optimizations to improve the performance of both the service and inference engine.
2930

3031
**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务,即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
3132

@@ -75,13 +76,13 @@ Learn how to use FastDeploy through our documentation:
7576

7677
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
7778
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
78-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||| WIP |128K |
79-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 ||||| WIP | 128K |
79+
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||| |128K |
80+
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 ||||| | 128K |
8081
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
8182
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
8283
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
83-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | |||||128K |
84-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ||||| 128K |
84+
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ||| ||128K |
85+
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ||||| 128K |
8586

8687
## Advanced Usage
8788

README_CN.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
[English](README.md) | 简体中文
2-
[English](README.md) | 简体中文
32
<p align="center">
43
<a href="https://github.com/PaddlePaddle/FastDeploy/releases"><img src="https://github.com/user-attachments/assets/42b0039f-39e3-4279-afda-6d1865dfbffb" width="500"></a>
54
</p>
@@ -24,9 +23,10 @@
2423
</p>
2524

2625
--------------------------------------------------------------------------------
27-
# FastDeploy 2.0:基于飞桨的大语言模型与视觉语言模型推理部署工具包
26+
# FastDeploy 2.1:基于飞桨的大语言模型与视觉语言模型推理部署工具包
2827

2928
## 最新活动
29+
**[2025-08] 🔥 FastDeploy v2.1 全新发布:** 全新的KV Cache调度策略,更多模型支持PD分离和CUDA Graph,昆仑、海光等更多硬件支持增强,全方面优化服务和推理引擎的性能。
3030

3131
**[2025-07] 《FastDeploy2.0推理部署实测》专题活动已上线!** 完成文心4.5系列开源模型的推理部署等任务,即可获得骨瓷马克杯等FastDeploy2.0官方周边及丰富奖金!🎁 欢迎大家体验反馈~ 📌[报名地址](https://www.wjx.top/vm/meSsp3L.aspx#) 📌[活动详情](https://github.com/PaddlePaddle/FastDeploy/discussions/2728)
3232

@@ -41,7 +41,6 @@
4141
-**高级加速技术**:推测解码、多令牌预测(MTP)及分块预填充
4242
- 🖥️ **多硬件支持**:NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU等
4343

44-
4544
## 要求
4645

4746
- 操作系统: Linux
@@ -73,13 +72,13 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU
7372

7473
| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
7574
|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
76-
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||| WIP |128K |
77-
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 ||||| WIP | 128K |
75+
|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 ||||| |128K |
76+
|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 ||||| | 128K |
7877
|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP || WIP || WIP |128K |
7978
|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 ||| WIP || WIP |128K |
8079
|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 ||||||128K |
81-
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | |||||128K |
82-
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ||||| 128K |
80+
|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ||| ||128K |
81+
|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ||||| 128K |
8382

8483
## 进阶用法
8584

docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ Installation process reference documentation [FastDeploy GPU Install](../get_sta
2727
**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
2828
```shell
2929
export ENABLE_V1_KVCACHE_SCHEDULER=1
30-
3130
python -m fastdeploy.entrypoints.openai.api_server \
3231
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
3332
--port 8180 \
@@ -47,7 +46,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
4746
**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
4847
```shell
4948
export ENABLE_V1_KVCACHE_SCHEDULER=1
50-
5149
python -m fastdeploy.entrypoints.openai.api_server \
5250
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
5351
--port 8180 \
@@ -64,6 +62,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
6462
--quantization wint4 \
6563
--enable-mm
6664
```
65+
66+
> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
67+
6768
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
6869
### 2.2 Advanced: How to Achieve Better Performance
6970

@@ -109,6 +110,15 @@ An example is a set of configurations that can run stably while also delivering
109110
- If slightly higher precision is required, you may try WINT8.
110111
- Only consider using BFLOAT16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
111112

113+
#### 2.2.4 **Adjustable environment variables**
114+
> **Rejection sampling:**`FD_SAMPLING_CLASS=rejection`
115+
- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
116+
- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
117+
118+
> **Attention Hyperparameter:**`FLAGS_max_partition_size=1024`
119+
- **Description:** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
120+
- **Recommendation:** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
121+
112122
## 3. FAQ
113123
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
114124

docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ Installation process reference documentation [FastDeploy GPU Install](../get_sta
2424
**Example 1:** Deploying a 128K context service on 8x H800 GPUs.
2525
```shell
2626
export ENABLE_V1_KVCACHE_SCHEDULER=1
27-
2827
python -m fastdeploy.entrypoints.openai.api_server \
2928
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
3029
--port 8180 \
@@ -42,6 +41,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
4241
--enable-mm
4342
```
4443

44+
> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
45+
4546
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
4647
### 2.2 Advanced: How to Achieve Better Performance
4748

@@ -87,6 +88,15 @@ An example is a set of configurations that can run stably while also delivering
8788
- If slightly higher precision is required, you may try wint8.
8889
- Only consider using bfloat16 if your application scenario demands extreme precision, as it requires significantly more GPU memory.
8990

91+
#### 2.2.4 **Adjustable environment variables**
92+
> **Rejection sampling:**`FD_SAMPLING_CLASS=rejection`
93+
- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
94+
- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.
95+
96+
> **Attention Hyperparameter:**`FLAGS_max_partition_size=1024`
97+
- **Description:** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
98+
- **Recommendation:** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.
99+
90100
## 3. FAQ
91101
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.
92102

docs/get_started/ernie-4.5-vl.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ Execute the following command to start the service. For parameter configurations
2323
>💡 **Note**: Since the model parameter size is 424B-A47B, on an 80G * 8 GPU machine, specify ```--quantization wint4``` (wint8 is also supported).
2424
2525
```shell
26+
export ENABLE_V1_KVCACHE_SCHEDULER=1
2627
python -m fastdeploy.entrypoints.openai.api_server \
2728
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
2829
--port 8180 --engine-worker-queue-port 8181 \

docs/get_started/installation/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
FastDeploy currently supports installation on the following hardware platforms:
44

55
- [NVIDIA GPU Installation](nvidia_gpu.md)
6+
- [Hygon DCU Installation](hygon_dcu.md)
67
- [Kunlun XPU Installation](kunlunxin_xpu.md)
78
- [Enflame S60 GCU Installation](Enflame_gcu.md)
89
- [Iluvatar GPU Installation](iluvatar_gpu.md)

docs/get_started/installation/nvidia_gpu.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12
2020

2121
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
2222
```shell
23-
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
23+
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
2424
```
2525

2626
Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
@@ -58,7 +58,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
5858

5959
First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
6060
```shell
61-
python -m pip install paddlepaddle-gpu==3.1.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
61+
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
6262
```
6363

6464
Then clone the source code and build:

docs/get_started/quick_start_vl.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ For more information about how to install FastDeploy, refer to the [installation
1919
After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
2020

2121
```shell
22+
export ENABLE_V1_KVCACHE_SCHEDULER=1
2223
python -m fastdeploy.entrypoints.openai.api_server \
2324
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
2425
--port 8180 \

docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@
99
|:----------:|:----------:|:------:| :------:|
1010
| A30 [24G] | 2 | 2 | 4 |
1111
| L20 [48G] | 1 | 1 | 2 |
12-
| H20 [144G] | 1 | 1 | 1 |
13-
| A100 [80G] | 1 | 1 | 1 |
14-
| H800 [80G] | 1 | 1 | 1 |
12+
| H20 [144G] | 1 | 1 | 1 |
13+
| A100 [80G] | 1 | 1 | 1 |
14+
| H800 [80G] | 1 | 1 | 1 |
1515

1616
### 1.2 安装fastdeploy
1717

@@ -26,7 +26,6 @@
2626
**示例1:** 4090上单卡部署32K上下文的服务
2727
```shell
2828
export ENABLE_V1_KVCACHE_SCHEDULER=1
29-
3029
python -m fastdeploy.entrypoints.openai.api_server \
3130
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
3231
--port 8180 \
@@ -46,7 +45,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
4645
**示例2:** H800上双卡部署128K上下文的服务
4746
```shell
4847
export ENABLE_V1_KVCACHE_SCHEDULER=1
49-
5048
python -m fastdeploy.entrypoints.openai.api_server \
5149
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
5250
--port 8180 \
@@ -63,6 +61,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
6361
--quantization wint4 \
6462
--enable-mm
6563
```
64+
> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。
65+
6666
示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
6767
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
6868
### 2.2 进阶:如何获取更优性能
@@ -110,6 +110,15 @@ python -m fastdeploy.entrypoints.openai.api_server \
110110
- 若需要稍高的精度,可尝试WINT8。
111111
- 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。
112112

113+
#### 2.2.4 **可调整的环境变量**
114+
> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
115+
- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
116+
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
117+
118+
> **Attention超参:**`FLAGS_max_partition_size=1024`
119+
- **描述**:Append Attntion(默认)后端的超参,我们在常用数据集上的测试结果表明,设置为1024后可以大幅提升解码速度,尤其是长文场景。
120+
- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
121+
113122
## 三、常见问题FAQ
114123
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`
115124

docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,6 @@
2323
### 2.1 基础:启动服务
2424
**示例1:** H800上8卡部署128K上下文的服务
2525
```shell
26-
export ENABLE_V1_KVCACHE_SCHEDULER=1
27-
2826
python -m fastdeploy.entrypoints.openai.api_server \
2927
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
3028
--port 8180 \
@@ -41,6 +39,8 @@ python -m fastdeploy.entrypoints.openai.api_server \
4139
--quantization wint4 \
4240
--enable-mm
4341
```
42+
> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。
43+
4444
示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
4545
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
4646
### 2.2 进阶:如何获取更优性能
@@ -87,6 +87,15 @@ python -m fastdeploy.entrypoints.openai.api_server \
8787
- 若需要稍高的精度,可尝试WINT8。
8888
- 仅当您的应用场景对精度有极致要求时候才尝试使用BFLOAT16,因为它需要更多显存。
8989

90+
#### 2.2.4 **可调整的环境变量**
91+
> **拒绝采样:**`FD_SAMPLING_CLASS=rejection`
92+
- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
93+
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。
94+
95+
> **Attention超参:**`FLAGS_max_partition_size=1024`
96+
- **描述**:Append Attntion(默认)后端的超参,我们在常用数据集上的测试结果表明,设置为1024后可以大幅提升解码速度,尤其是长文场景。
97+
- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。
98+
9099
## 三、常见问题FAQ
91100
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`
92101

0 commit comments

Comments
 (0)